[07:20:44] FIRING: LiftWingServiceErrorRate: ... [07:20:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=ptwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [07:25:44] RESOLVED: LiftWingServiceErrorRate: ... [07:25:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=ptwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [08:10:32] Hello! [08:23:07] did anybody look at the above alert? [08:44:33] kevinbazira: o/ since you're looking into the optimum benchmark, could you see if you can run a benchmark with a similar configuration as llmperf and produce the input/output heatmaps that we want to have? [08:46:21] I mean after you work on simplifying the way to run it on ml-lab [08:46:40] that would make more sense since we would be able to iterate faster [09:26:40] isaranto: o/ okok I'll look into producing the heatmaps too. [09:27:29] thank you. I will look into the optimum configuration as well later today [09:32:34] ack! [10:24:24] Morning! [10:25:18] I've taken a quick look at the Istio and kserve logs. Istio sees a bunch of 502s from o-legacy, but there are no elevated error counts in the backing service itself. Bursty traffic though. So it looks like the "service is stuck" problem [10:35:11] Ack thanks for looking into it! [10:56:25] (03Abandoned) 10Nik Gkountas: store in diskcache the process id of the worker that updates the cache [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1102875 (owner: 10Nik Gkountas) [10:56:57] (03CR) 10Nik Gkountas: [C:03+2] Run cache updater task in all workers [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1102926 (owner: 10Sbisson) [10:58:25] (03CR) 10CI reject: [V:04-1] Run cache updater task in all workers [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1102926 (owner: 10Sbisson) [11:16:03] * klausman lunch [11:48:12] (03CR) 10KartikMistry: "recheck" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1102926 (owner: 10Sbisson) [11:51:14] (03CR) 10KartikMistry: [C:03+2] Run cache updater task in all workers [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1102926 (owner: 10Sbisson) [11:51:53] (03Merged) 10jenkins-bot: Run cache updater task in all workers [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1102926 (owner: 10Sbisson) [14:47:29] 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work, 13Patch-For-Review: Request to host article-country model on Lift Wing - https://phabricator.wikimedia.org/T371897#10403210 (10Isaac) So exciting to see -- thanks @kevinbazira ! Sounds like we can also now move the model card from Proposed to Production :)... [15:26:27] (03PS1) 10Sbisson: Smoke test to check deployments [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1103355 [15:27:34] (03PS2) 10Sbisson: Smoke test to check deployments [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1103355 [16:11:12] (03PS3) 10Sbisson: Smoke test to check deployments [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1103355 [17:48:32] 06Machine-Learning-Team, 10ORES, 05FY2024-25 WE4.2, 10Moderator-Tools-Team (Kanban), 07Spike: [SPIKE] Investigate how to install ORES in idwiki [8HRS] - https://phabricator.wikimedia.org/T374077#10403667 (10Kgraessle) 05Open→03Resolved a:03Kgraessle [17:51:17] Going afk, have a nice weekend everyone! [18:10:34] o/ here is how we can easily run the HF optimum benchmark on ml-lab: [18:10:34] https://gitlab.wikimedia.org/repos/machine-learning/huggingface-optimum-benchmark-automation [18:10:35] try it and let me know whether it works for you :) [18:11:26] also going afk! 👋 [22:28:13] (03CR) 10CI reject: [V:04-1] build: Updating nanoid to 3.3.8 [extensions/ORES] (REL1_41) - 10https://gerrit.wikimedia.org/r/1103508 (owner: 10Libraryupgrader) [22:36:03] (03CR) 10CI reject: [V:04-1] build: Updating nanoid to 3.3.8 [extensions/ORES] (REL1_42) - 10https://gerrit.wikimedia.org/r/1103509 (owner: 10Libraryupgrader) [22:39:31] (03CR) 10CI reject: [V:04-1] build: Updating nanoid to 3.3.8 [extensions/ORES] (REL1_43) - 10https://gerrit.wikimedia.org/r/1103510 (owner: 10Libraryupgrader)