[06:05:44] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: ...
[06:05:49] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revision-models&var-backend=reference-quality-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[06:10:44] <jinxer-wm>	 FIRING: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate  - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[06:55:44] <jinxer-wm>	 FIRING: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate  - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[07:00:44] <jinxer-wm>	 RESOLVED: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate  - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[08:12:43] <isaranto>	 Hello! 
[08:39:31] <georgekyz>	 goedenmorgen
[08:42:04] <isaranto>	 I'm taking a look in the alerts above
[08:49:44] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: ...
[08:49:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revision-models&var-backend=reference-quality-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[08:54:44] <jinxer-wm>	 RESOLVED: LiftWingServiceErrorRate: ...
[08:54:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revision-models&var-backend=reference-quality-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[08:55:11] <isaranto>	 georgekyz: do you want to go through the alerts together?
[08:55:40] <georgekyz>	 yeah sure 
[08:56:31] <isaranto>	 pinging you in 3'
[08:56:43] <georgekyz>	 ok
[09:34:25] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work, 13Patch-For-Review: Update the article-country isvc to use Wikilinks for predictions - https://phabricator.wikimedia.org/T385970#10566952 (10dcausse) Unfortunately I had to pause the backfill, we've burnt quite some budget of our update lag SLO (https://gra...
[09:35:02] <isaranto>	 we're looking at the Istio access logs dashboard but it seems broken (can't see the requests) - https://logstash.wikimedia.org/goto/fabe47bbf64a4f898f7aadcadbf06fe1
[09:35:28] <isaranto>	 trying to figure out why it is broken but if anyone has an idea help is welcome
[09:38:15] <klausman>	 Morning!
[09:38:20] <klausman>	 having a look as well
[09:39:34] <isaranto>	 we are here if you want to join https://meet.google.com/tew-iwdw-fqn
[09:39:42] <isaranto>	 thanks for looking into it
[09:47:59] <klausman>	 I think Logstash is having an issue. I have poked o11y about it
[09:54:02] <isaranto>	 we are also seeing something else that is weird (apart from the increased latencies :P ). Although there are 3 pods deployed in the revision-models ns in ml-serve-eqiad, we only see 1 of them in grafana https://grafana.wikimedia.org/goto/dOdYMBcHR?orgId=1
[10:02:33] <klausman>	 I see only one pod in serve-codfw/revision-models
[10:02:42] <klausman>	 (on the deployment host)
[10:03:02] <isaranto>	 yes codfw has 1 but eqiad has 3
[10:03:14] <klausman>	 The grafana link you sent was for codfw
[10:03:22] <isaranto>	 yes sry my bad
[10:03:32] <klausman>	 and the eqiad dashboard shows three pods :)
[10:04:00] <isaranto>	 thanks! my mind was stuck that I had the right filters
[10:04:21] <klausman>	 it happens./ at least you're not doing what I have done in the past: restart the pods on the wrong DC :D
[10:06:50] <isaranto>	 :D
[10:07:10] <elukey>	 istio dashboard working now :)
[10:07:22] <klausman>	 Logstash should work again (thanks to Luca and Filippo)
[10:07:37] <isaranto>	 grazie mille!
[10:08:16] <klausman>	 The traffic spike around 09:42 is mostl WME, as far as I can tell.
[10:20:45] <isaranto>	 yes. all traffic goes to reference quality model server (reference-risk and reference-need)
[10:21:27] <isaranto>	 aiko: we are seeing very high latencies https://grafana.wikimedia.org/goto/QmnH4fcHg?orgId=1
[10:21:27] <isaranto>	 and high cpu usage on these pods https://grafana.wikimedia.org/goto/FgQHVf5HR?orgId=1
[10:24:36] <isaranto>	 and it causes throttling both in the preprocess and the predict stage. Anything else we could look into apart from increasing cpu limits?
[11:19:03] * klausman lunch
[11:48:06] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10567209 (10isarantopoulos) I've managed to build the base vllm image on ml-lab and it is 34GB.  Compressing the image brings it down to 7.0GB.  I'm proceeding to also build the final image which...
[11:48:26] <isaranto>	 finally built the base vllm image on ml-lab! 🎉
[12:00:21] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Create dummy peacock detection model server - https://phabricator.wikimedia.org/T386100#10567272 (10isarantopoulos)
[12:00:52] <isaranto>	 georgekyz: --^ I've updated the task description to define the steps/patches we need to make
[12:04:36] <georgekyz>	 isaranto: thnx a lot I will check it
[12:26:20] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10567375 (10isarantopoulos) I've replicated building locally on ml-lab both images using the Dockerfiles. The final image is 35.7GB and 7.6GB compressed (as expected cause this is also the upstrea...
[12:52:43] <aiko>	 isaranto: maybe increase maxReplicas for referece-quality model server? would it help?
[13:02:14] <isaranto>	 you'right it might help. Going to do that and then see about increasing cpu
[13:03:04] <isaranto>	 it would be good if we figure out what is causing this throttling (mwapi ? sqlite?) to see which would be the best way to handle horizontal vs vertical scaling
[13:05:50] <isaranto>	 increasing it to 5 to start with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1121362
[13:07:42] <isaranto>	 aiko: ok if I merge the above?
[13:08:51] <isaranto>	 thank you!
[13:10:57] <isaranto>	 deployed!
[13:13:04] <isaranto>	 it seems we better also limit the knative autoscaling targets to something lower than 20 rps. revertrisk is much more lightweight as a service and has 15
[13:13:35] <isaranto>	 the traffic at the moment is a bit less than 20 rps https://grafana.wikimedia.org/goto/B5fZ0BcHg?orgId=1
[13:14:04] <isaranto>	 which means that probably autoscaling wont kick in. 
[13:14:10] * isaranto going for lunch and will check again
[13:14:29] <aiko>	 isaranto: grafana shows high latencies in preprocess but not predict stage for reference-risk, so it would probably be mwapi, not sqlite causing the throttling
[13:14:54] <isaranto>	 saw also in predict from the logs but only after throttling had already happened
[13:15:10] <isaranto>	 check here for example https://logstash.wikimedia.org/goto/7050fe1163e34736ab4302cc5b563bee
[13:17:48] <isaranto>	 I didnt remember that sqlite is used in predict, thanks for pointing it out!
[13:18:07] <aiko>	 can we tell in the logs the request is calling reference-need or reference-risk? 
[13:23:29] <aiko>	 because for reference-need it looks like both preprocess and predict have high latencies, for reference-risk, only preprocess shows high latencies
[13:25:17] <aiko>	 yeah sqlite is used in reference-risk's prediction. there is no model, just fetching data from db
[13:25:43] <aiko>	 https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/blob/main/knowledge_integrity/models/reference_risk/model.py#L163
[13:37:05] <aiko>	 we set the knative autoscaling target to 3 here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/ml-services/revision-models/values.yaml#59 
[13:39:24] <aiko>	 it should overwrite the 20 at line 50?
[13:48:44] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: ...
[13:48:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revision-models&var-backend=reference-quality-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[13:48:45] <elukey>	 isaranto: if you have time, could you please print the layers of the docker image that you built in the task?
[13:49:58] <isaranto>	 yes I have the full build logs also saved
[13:50:27] <elukey>	 I am asking since I'd like to see if we can break down layers into max 4GB 
[13:50:57] <elukey>	 for example, if you RUN something with && and you pip install pytorch vllm etc.., maybe doing multiple RUNs could help (since you'd create more layers)
[13:51:23] <isaranto>	 ack
[13:51:23] <elukey>	 having 34G of image is something really sad, but it is mostly a rant to the echosystem, not at you :)
[13:51:49] <isaranto>	 shall I apend these to the task or in a paste?
[13:52:07] <elukey>	 in the task is fine, is it something 1000 lines long?
[13:52:14] <elukey>	 otherwise a paste
[13:53:10] <isaranto>	 no it is ~20-30
[13:54:25] <elukey>	 then it should be fine
[13:58:59] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10567735 (10isarantopoulos) Below we can see the uncompressed layer sizes for both images. going to get you the full created_by column as well but I have to modify the output a bit to paste it her...
[14:12:59] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work, 13Patch-For-Review: Update the article-country isvc to use Wikilinks for predictions - https://phabricator.wikimedia.org/T385970#10567758 (10Isaac) Thanks for the update -- given that work is still in progress on the keyword and we've paused the use of cirr...
[15:57:58] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10568210 (10isarantopoulos) this is the base image layers with the full instructions for each one. It seems that there is the 29GB layer that we could break into smaller ones. `  IMAGE...
[16:23:05] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10568321 (10isarantopoulos) Looking at the large layer and the thing it installs we focus on `rocm-dev` and `rocm-libs` which are meta packages and contain the following: ` apt-cache depends rocm-...
[16:25:51] <isaranto>	 now there is no traffic on reference-quality so we won't know, however I'll make a patch to reduce the autoscaling limits
[16:25:56] <isaranto>	 *triggers not limits
[16:30:06] <isaranto>	 actually there is just one pod service 500s
[16:30:15] <isaranto>	 :(
[16:34:31] <isaranto>	 hmm there is no traffic at the moment. last request was made @ 14:50 UTC. However I do see 503 error messages on the istio dashboard and can't understand what this is https://grafana.wikimedia.org/goto/49txXf5HR?orgId=1
[16:35:09] <klausman>	 I also saw 500s from the MWAPI, but it was just before the mtg
[16:35:39] <isaranto>	 yes these were the ones I saw coming from mwapi
[17:02:02] <wikibugs>	 (03PS1) 10Gkyziridis: inference-services: add peacock dummy model service [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100)
[17:02:17] * isaranto afk
[17:02:48] <wikibugs>	 (03PS2) 10Gkyziridis: inference-services: add peacock dummy model service [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100)
[17:48:44] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: ...
[17:48:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revision-models&var-backend=reference-quality-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[17:49:32] <klausman>	 I'll add a silence for that ^^^