[01:38:59] FIRING: LiftWingServiceErrorRate: ... [01:38:59] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [03:11:23] (03PS2) 10AikoChou: events: construct new prediction classification event independently [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1191305 (https://phabricator.wikimedia.org/T405067) [03:14:19] (03CR) 10AikoChou: events: construct new prediction classification event independently (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1191305 (https://phabricator.wikimedia.org/T405067) (owner: 10AikoChou) [03:21:42] (03CR) 10AikoChou: "Thanks for working on this! I left a few comments :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1191038 (https://phabricator.wikimedia.org/T371021) (owner: 10Bartosz Wójtowicz) [04:35:02] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11223140 (10achou) Based on the discussion with @Michael, we're thinking to add... [05:38:59] FIRING: LiftWingServiceErrorRate: ... [05:38:59] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [05:48:44] RESOLVED: LiftWingServiceErrorRate: ... [05:48:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [06:07:15] good morning! [06:19:44] FIRING: LiftWingServiceErrorRate: ... [06:19:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [06:22:52] --^ this had occurred again and there is a task for it https://phabricator.wikimedia.org/T403709 [06:23:06] but it has no extra info -- what was the root cause etc [06:23:37] if it were the increased preprocessing times we should enable multiprocessing [06:28:03] ack. looking ... [07:03:10] good morning [07:40:20] indeed that's a recurring error reported in both: [07:40:20] https://phabricator.wikimedia.org/T401109 [07:40:20] https://phabricator.wikimedia.org/T403709 [07:45:04] logstash logs: https://logstash.wikimedia.org/goto/b2c2b2f7df3f7b6cf9988a363d954c17 [07:45:04] problem shown in the logs: `ERROR:root:An error has occurred while fetching feature values from the MW API` [07:54:45] 06Machine-Learning-Team: Increased 5xx error rate in revscoring itwiki damaging - https://phabricator.wikimedia.org/T403709#11223334 (10kevinbazira) This issue is recurring, today (29/09/2025) an alert for the itwiki-damaging-predictor service was triggered with the following information: ` itwiki-damaging-predi... [08:14:27] kevinbazira: the issue seem to be confined to codfw this time, eqiad looks good afaics [08:14:37] there was a lot of scale ups/down for it wiki [08:14:38] https://grafana.wikimedia.org/d/c6GYmqdnz/knative-serving?orgId=1&from=now-6h&to=now&timezone=utc&var-cluster=D-2kXvZnk&var-knative_namespace=knative-serving&var-revisions_namespace=revscoring-editquality-damaging&viewPanel=panel-24 [08:16:30] it would be interesting to understand if the timeouts were a problem on the MediaWiki side or if it was a side effect of something else (like a general slowdown= [08:16:36] yes, it's within codfw. likely because of the recent data center switchover traffic is hitting codfw as the primary DC [08:18:33] sure [08:18:36] https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=D-2kXvZnk&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default-00029&from=now-6h&to=now&timezone=utc&var-response_code=$__all&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99 [08:18:47] there seems to be two relevant thingsç [08:19:15] 1) there is a constant traffic with return code "0", that in Istio terms it means client giving up after a while.. Not sure if it is expected [08:19:39] 2) there was a rise of HTTP 200 traffic, that could explain the scale up and down [08:20:03] so the timeouts may have happened while scale ups happened [08:21:17] what I mean is - let's try to figure out what happened before the timeouts, not stopping only at those errors etc.. [08:23:58] yep, on it. I am digging through the logs: https://phabricator.wikimedia.org/P83478 [08:24:03] ozge@deploy2002:~$ kubectl logs itwiki-damaging-predictor-default-00029-deployment-7cc74b6ww7mf | grep "POST /v1/models/itwiki-damaging%3Apredict HTTP/1.1" [08:24:03] ozge@deploy2002:~$ kubectl get pods | grep itwiki [08:24:03] itwiki-damaging-predictor-default-00029-deployment-7cc74b6dmlvt 3/3 Running 0 12m [08:24:04] itwiki-damaging-predictor-default-00029-deployment-7cc74b6ww7mf 3/3 Running 0 5d2h [08:24:04] I think this happened before. The old pod has never returned a successful response. Our guess was that the side car didn't start properly. [08:24:32] @kevinbazira it would be great to validate though [08:27:30] ozge_: this may be a consequence of scaling up and down pods in a relatively short time window. From the above graph it seems that we jumped from 1 to 2 (and 3) several times, to then kill the pods again [08:38:30] ozge_: Hey alles goed? I remember you had mentioned something the previous week around the ml-pipelines CICD. Is it possible to leave a comment in the phab task whenever you find some time? The task is here: https://phabricator.wikimedia.org/T404717#11209119 [08:39:12] ozge_: I tried to test it as much as I could, but I am not sure if this works exactly as we want it for each of the models. Please whenever you have time cast an eye over the ticket and leave a comment. I can continue testing/working on it. [08:52:09] @georgekyz dankuwel! even kijken :) 👀 [08:53:16] 👍 [09:14:05] (03PS5) 10Bartosz Wójtowicz: articletopic: Add `page_id` parameter to the articletopic model. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1191038 (https://phabricator.wikimedia.org/T371021) [09:19:42] (03CR) 10Bartosz Wójtowicz: articletopic: Add `page_id` parameter to the articletopic model. (034 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1191038 (https://phabricator.wikimedia.org/T371021) (owner: 10Bartosz Wójtowicz) [09:45:06] elukey: o/ re: timeouts may have happened while scale ups happened [09:45:06] based on the info you shared, I looked at the KServe container memory usage and it didn't exceed the 2GB limit: https://grafana.wikimedia.org/goto/JtgWnp3NR?orgId=1 [09:45:06] now I am wondering why the scaling events happened. [09:47:53] scaling for knative usually happens for rps being handled [09:48:22] we set some specific values for each isvc, and knative usually takes action when ~70% of the limit set is reached [09:49:44] RESOLVED: LiftWingServiceErrorRate: ... [09:49:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [09:50:58] 06Machine-Learning-Team: Add support for K8s 1.23 on Trixie - https://phabricator.wikimedia.org/T405891 (10elukey) 03NEW [09:52:36] 06Machine-Learning-Team: Add support for K8s 1.23 on Trixie - https://phabricator.wikimedia.org/T405891#11223939 (10elukey) These should be the packages to copy over to trixie-wikimedia: ` elukey@ml-serve1009:~$ dpkg -l | egrep 'kube|istio|cni' ii calico-cni 3.23.3-1... [09:53:51] kevinbazira: you can find the following annotations in deployment-charts: [09:53:53] autoscaling.knative.dev/metric: "rps" [09:53:53] autoscaling.knative.dev/target: "10" [09:57:25] great. thank you. I've seen them: https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/ml-services/revscoring-editquality-damaging/values.yaml#L49-L50 [10:14:24] @georgekyz I think it looks awesome. I only had a small change https://gitlab.wikimedia.org/repos/machine-learning/ml-pipelines/-/commit/a21f5c3c92e11384e7ad7145e465fb041e90079e https://gitlab.wikimedia.org/repos/machine-learning/ml-pipelines/-/commit/c661b3605da12d84a2dab2633aab83a9edbb2a22 . these two pipelines were being triggered on the first push to the branch. [10:17:33] ozge_: Thank you so much for your time mate! Have you already merged those changes ? [10:18:50] indeed unfortunately for testing [10:19:15] perfeeeect! [10:21:44] FIRING: LiftWingServiceErrorRate: ... [10:21:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [10:24:06] ozge_: it is the same issue, checked the logs :( [10:24:16] kevinbazira: only one pod is emitting those errors, the other one is fine [10:24:30] same thing that ozge_ found the last time that happened [10:25:20] so do we restart this pod? [10:25:25] thank you @elukey for confirming [10:26:25] errors are from: `itwiki-damaging-predictor-default-00029-deployment-7cc74b6ww7mf` [10:26:37] I think we can take a look how the side car creates this connection and if it's possible to fail or re-try. need to find where it's implemented though [10:27:25] the funny thing is this one [10:27:28] root@deploy1003:~# kubectl logs itwiki-damaging-predictor-default-00029-deployment-7cc74b6jllbq -n revscoring-editquality-damaging istio-proxy | tail -n 1000 | jq '.response_code' | uniq [10:27:28] 200 [10:27:51] ah ok wrong pod [10:27:53] lemme recheck [10:28:32] ok yes makes sense [10:28:35] root@deploy1003:~# kubectl logs itwiki-damaging-predictor-default-00029-deployment-7cc74b6ww7mf -n revscoring-editquality-damaging istio-proxy | tail -n 1000 | jq '.response_code' | sort | uniq -c [10:28:35] 517 0 [10:28:35] 483 200 [10:31:00] so are there actually some successful responses from the side-car or are they health checks? [10:33:47] from a quick glance, the latter (/metrics too etc..) [10:37:39] 06Machine-Learning-Team: Increased 5xx error rate in revscoring itwiki damaging - https://phabricator.wikimedia.org/T403709#11224066 (10elukey) It seems to be again one old pod (out of two available) blackholing some traffic. This is what a non-health-check/metrics log looks like in istio-proxy: ` { "response... [10:37:46] added some examples --^ [10:37:59] clients reach 60s and they give up [10:38:51] I am also seeing a lot of these [10:38:52] 2025-09-29 10:24:55.083 kserve.trace kserve.io.kserve.protocol.rest.v1_endpoints.predict: 300.6564943790436 [10:41:06] If this is related to service scaling, could it be that the pod signals readiness to early? In that it already gets requests before it can service them, and then they get stuck in limbo until the client disconnects [10:41:51] the pod has been up for ~5 days afaics [10:42:43] ok, that kills that hypothesis :) [10:44:04] I think at this point we should save logs and kill the pod [10:44:17] klausman: do you have time to do it? [10:44:22] task is https://phabricator.wikimedia.org/T403709 [10:44:52] Will do it in a bit, I'm in the middle of a fish curry :) [10:45:08] lemme do it then [10:45:15] ty! [10:49:03] 06Machine-Learning-Team: Increased 5xx error rate in revscoring itwiki damaging - https://phabricator.wikimedia.org/T403709#11224112 (10elukey) Saved all the logs on `deploy1003:/home/elukey/T403709` and killed the old pod. [10:49:40] kevinbazira: pod killed, we should see recovery.. I added all logs in my home dir on deploy1003 (see task for more details) [10:50:33] thanks a lot elukey! going to continue monitoring this service [10:51:44] RESOLVED: LiftWingServiceErrorRate: ... [10:51:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [10:54:54] goood [10:56:25] molte grazie! [10:57:00] ahhaha perfect italian Kevin [10:57:25] thanks to google translate :) [13:30:15] 06Machine-Learning-Team: Add support for K8s 1.23 on Trixie - https://phabricator.wikimedia.org/T405891#11224551 (10elukey) Packages copied to the new components in trixie-wikimedia, the next step is to test a kubernetes worker :) [16:18:13] (03CR) 10Ottomata: events: construct new prediction classification event independently (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1191305 (https://phabricator.wikimedia.org/T405067) (owner: 10AikoChou) [21:36:00] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11227024 (10Eevans) >>! In T401021#11223140, @achou wrote: > > [ ... ] > > Updat... [23:40:33] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review: Data Persistence Design Review: Article topic model caching - https://phabricator.wikimedia.org/T402984#11227502 (10Eevans) >>! In T402984#11192876, @BWojtowicz-WMF wrote: > **Why do we need Cache** > > Machine Learning Team deci...