[00:37:59] FIRING: LiftWingServiceErrorRate: ... [00:37:59] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [04:37:59] FIRING: LiftWingServiceErrorRate: ... [04:37:59] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [06:37:06] good morning. [07:03:59] good morning :) [07:30:35] 06Machine-Learning-Team, 07Essential-Work: Upgrade revscoring model servers from debian bullseye to bookworm - https://phabricator.wikimedia.org/T400350#11063507 (10OKarakaya-WMF) I could build the project minimum with `scikit-learn==1.2.0` However, this version raises following error in inference time. ` r... [07:33:43] 06Machine-Learning-Team: Error in revscoring-editquality-damaging - itwiki-damaging-predictor-default - https://phabricator.wikimedia.org/T401109#11063525 (10Joe) > We have 5M successful responses in the same period of time. Sorry I hadn't realized the number of requests was this large, it reframes the problem a... [07:36:18] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Upgrade reability model servers from debian bullseye to bookworm - https://phabricator.wikimedia.org/T400352#11063544 (10OKarakaya-WMF) As discussed with @BWojtowicz-WMF , old catboost version does not work after the upgrades. Therefore, we will... [07:45:05] 06Machine-Learning-Team: Error in revscoring-editquality-damaging - itwiki-damaging-predictor-default - https://phabricator.wikimedia.org/T401109#11063591 (10OKarakaya-WMF) cool, indeed I have a untested branch for 503 re-tries: https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-se... [08:37:59] FIRING: LiftWingServiceErrorRate: ... [08:37:59] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [09:10:37] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Upgrade revertrisk model server from the debian bullseye base image to bookworm. - https://phabricator.wikimedia.org/T400266#11063965 (10gkyziridis) 05Open→03Resolved [09:22:14] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 10SRE-SLO, 10Editing-team (Tracking): Create SLO dashboard for tone (peacock) check model - https://phabricator.wikimedia.org/T390706#11064019 (10gkyziridis) Hey @elukey thnx for sharing this issue. I have a question: Is this issue blocking the A/B testi... [09:47:44] FIRING: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [10:06:48] prod issue about revertrisk: https://wikimedia.slack.com/archives/G01A0FNPLG4/p1754474765246089 [10:07:04] this is about the alarm above [10:18:48] hello @bartosz, @georgekyz is this deployed in last one hour? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1175888 [10:19:20] ozge_: yeao [10:19:22] yeap* [10:20:23] we started to get the error I shared in Slack. looks like it started in last one hour [10:20:29] should we revert the change? [10:29:24] I am not sure I am searching what would happened [10:29:40] the deployment was only for the language-agnostic model [10:38:00] I will open a patch for that right now. What I see is the following: [10:38:00] 1. There are two versions for both revertrisk-language-agnostic and for revertrisk-multilingual [10:38:00] 2. I am opening a patch for adding the latest image for both revertrisk-language-agnostic and revertrisk-language-agnostic-pre-save [10:38:00] 3. For multilingual one the revertrisk-multilingual has an older image version from the revertrisk-multilingual-pre-save, the pre-save one has the same image as in staging although the revertrisk-multilingual has an older one. [10:40:17] it turned out to be related to this issue: https://phabricator.wikimedia.org/T399437#11031178. https://wikimedia.slack.com/archives/G01A0FNPLG4/p1754476681938279?thread_ts=1754474765.246089&cid=G01A0FNPLG4 [10:41:09] It seems we fixed this issue for the pre-save one, but not for the regular one 🙈 [10:41:17] interesting that it comes up only today tho [10:42:30] that was confusing. and it started just after the latest release. I suppose it was not deployed without git. [10:43:32] I'll create an MR [10:44:20] Check those two files: [10:44:21] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/ml-services/revertrisk/values-ml-staging-codfw.yaml#4 [10:44:21] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/ml-services/revertrisk/values.yaml#83 [10:45:18] What I will do is the following: I will add exactly the image that we are using on staging on both versions (pre-save and normal one) for both multilingual and language-agnostic [10:49:27] patch ready please review: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1176199 [10:49:38] ozge_: bartosz: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1176199 [10:51:15] ozge_: merge it ?? [10:53:26] approved! thanks [10:54:26] thank you both folks! I am merging it right now [10:54:31] I will proceed with the deployument [10:58:18] 🙌 [11:00:38] Deployed! [11:00:57] 🤘 [11:01:05] 06Machine-Learning-Team: Numpy is not available in revertrisk_multilingual - https://phabricator.wikimedia.org/T401305 (10OKarakaya-WMF) 03NEW [11:02:15] cool, I'll monitor for awhile. I created an issue to keep some information [11:06:18] 06Machine-Learning-Team: Numpy is not available in revertrisk_multilingual - https://phabricator.wikimedia.org/T401305#11064494 (10OKarakaya-WMF) after investigating with @BWojtowicz-WMF and @gkyziridis , we found out that this was related to a previous issue and we got the same error before: https://phabricator... [11:07:41] thnx @ozge_ [11:07:44] FIRING: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [11:07:49] crap... [11:08:17] ah this is not related to revertrisk thankfuly [11:09:02] it's a known issue: https://phabricator.wikimedia.org/T401109 [11:13:20] alright at least we do not have any incidents for revertrisk after the last deployment right ? [11:39:04] https://usercontent.irccloud-cdn.com/file/R6nxeaSX/image.png [11:39:22] awesome, it has stopped raising errors [12:19:24] 06Machine-Learning-Team: Error in revscoring-editquality-damaging - itwiki-damaging-predictor-default - https://phabricator.wikimedia.org/T401109#11064781 (10elukey) >>! In T401109#11063591, @OKarakaya-WMF wrote: > cool, indeed I have an untested branch for 503 re-tries: > https://gerrit.wikimedia.org/r/plugins/... [12:20:35] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 10SRE-SLO, 10Editing-team (Tracking): Create SLO dashboard for tone (peacock) check model - https://phabricator.wikimedia.org/T390706#11064788 (10elukey) Hey @gkyziridis, nono this is something related to the SLO itself, we'll need to review the targets... [12:23:01] really nice work folks! [15:07:59] FIRING: LiftWingServiceErrorRate: ... [15:07:59] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [15:32:44] RESOLVED: LiftWingServiceErrorRate: ... [15:32:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [16:08:44] FIRING: LiftWingServiceErrorRate: ... [16:08:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [20:09:31] FIRING: LiftWingServiceErrorRate: ... [20:09:31] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate