[04:36:13] (03PS1) 10Kevin Bazira: article-country: add support for wikilink-related predictions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125661 (https://phabricator.wikimedia.org/T385970) [04:42:39] (03CR) 10Kevin Bazira: "for more context, key requirements for this task were discussed in https://phabricator.wikimedia.org/P73436 with Isaac from the Research T" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125661 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [08:04:12] Good morning! [08:10:03] isaranto: o/ [08:10:07] not urgent but I noticed https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DHelmReleaseBadStatus [08:21:06] 06Machine-Learning-Team, 07Kubernetes, 13Patch-For-Review: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493#10617190 (10elukey) 05Open→03Stalled This task is stalled until T387854 is completed. [08:23:52] o/ elukey . Thanks for bringing this up. I'll take a look in a bit [08:24:22] Buongiorno! [08:29:58] Kalimera! [08:39:36] Good morning all [09:22:35] \o George [09:22:44] I just re-deployed in that ns and the alert went away [09:39:51] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Investigate options for providing beta cluster / patchdemo access to liftwing staging - https://phabricator.wikimedia.org/T388269#10617536 (10isarantopoulos) [10:03:21] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Investigate options for providing beta cluster / patchdemo access to liftwing staging - https://phabricator.wikimedia.org/T388269#10617652 (10isarantopoulos) Beta cluster is deployed on cloud VPS and I think the same holds true for patchdemo. Access to LiftW... [10:05:48] isaranto: What is ns :P ?? [10:06:26] 🤭 [10:10:18] a sorry ns is an abbreviation for `namespace` [10:12:59] ah alright [10:14:01] Ilias is too cloud-native at this stage [10:15:00] lol [10:15:17] I apologize :D [10:15:49] I am trying to use cloud/infra jargon to give the impression that I know stuff :P [10:20:01] 🤣 [11:13:37] 07artificial-intelligence, 06Machine-Learning-Team, 10Edit-Review-Improvements-RC-Page, 10editquality-modeling, and 3 others: Enable ORES in RecentChanges for Hindi Wikipedia - https://phabricator.wikimedia.org/T303293#10617871 (10Michael) Recent Changes has been moved to the Moderator Tools Team. [11:59:47] (03PS1) 10Gkyziridis: inference-services: Develop loading peacock model logic. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126012 (https://phabricator.wikimedia.org/T386100) [12:01:07] (03CR) 10CI reject: [V:04-1] inference-services: Develop loading peacock model logic. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126012 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [12:06:32] (03PS2) 10Gkyziridis: inference-services: Develop loading peacock model logic. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126012 (https://phabricator.wikimedia.org/T386100) [12:07:06] (03CR) 10Gkyziridis: inference-services: Develop loading peacock model logic. (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126012 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [14:25:50] (03PS1) 10Ilias Sarantopoulos: reference-quality: revert multiple workers [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126061 (https://phabricator.wikimedia.org/T387019) [14:35:25] 06Machine-Learning-Team, 06collaboration-services, 10Discovery-Search (2025.03.01 - 2025.03.21), 10Wikipedia-iOS-App-Backlog (iOS Release FY2024-25): [Spike] Fetch Topics for Articles in History on iOS app - https://phabricator.wikimedia.org/T379119#10619055 (10Jelto) I reached out to the machine learning... [14:40:05] (03PS2) 10Ilias Sarantopoulos: reference-quality: revert multiple workers [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126061 (https://phabricator.wikimedia.org/T387019) [14:47:00] (03PS1) 10Ilias Sarantopoulos: reference-quality: set reference-need batch size through env var [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126070 (https://phabricator.wikimedia.org/T387019) [14:47:41] (03CR) 10CI reject: [V:04-1] reference-quality: set reference-need batch size through env var [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126070 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [14:50:42] (03CR) 10Ilias Sarantopoulos: "Needs to be merged after https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/merge_requests/53" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126070 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [15:00:12] klausman: o/ any idea about https://phabricator.wikimedia.org/T388269 [15:00:14] ? [15:00:26] looking... [15:00:41] I remember we have mentioned this in the past but I don't recall how difficult it is [15:02:33] The only way in from cloudvps is through the API GW, which is currently not set up for staging at all. It would not be impossible to change that, but remember that anything accessible from cloudvps is accesible to everyone. [15:08:45] ok, just as I remembered/feared then... :D [15:28:46] 06Machine-Learning-Team, 10EditCheck, 06Editing-team, 10VisualEditor: Evaluate efficacy of Peacock Check model output - https://phabricator.wikimedia.org/T384651#10619408 (10ppelberg) [16:21:24] (03PS1) 10Ilias Sarantopoulos: reference-quality: allow to deploy models separately [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126094 (https://phabricator.wikimedia.org/T387019) [16:21:45] I set the above as WIP, need to test it first [16:37:39] 07artificial-intelligence, 10research-ideas: Modeling political orientation of editors - https://phabricator.wikimedia.org/T166288#10619823 (10Isaac) 05Open→03Declined I came across it in a research backlog and there's been some research over the past few years that I think helps with understanding in... [16:38:43] * isaranto afk! [17:23:22] heads-up, the API gateway is paging for lw_inference_reference_need_cluster [17:25:04] also getting timeouts generally for many liftwing services [17:25:50] just reference_need and reference_risk, to clarify [17:28:32] preprocess_ms: 196025.131702423, explain_ms: 0, predict_ms: 395.878791809, postprocess_ms: 0.00786781 [17:28:57] this is a known issue sigh [17:30:05] it seems the cpubound code that freezes the ioloop again [17:30:28] isaranto: around? [17:30:32] or aiko or klausman [17:32:52] confirmed, see [17:32:52] https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&var-datasource=thanos&var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-pod=reference-quality-predictor-00011-deployment-6cddf4585b-6j5fb&var-container=All&from=now-3h&to=now [17:34:32] and it seems all WME [17:34:40] o/ I am here now [17:34:57] https://logstash.wikimedia.org/goto/e539b0adffd339dd1ccc79b5f1d11767 [17:35:07] thank you both. Unfortunately this is a known issue and we've been trying to fix it. [17:35:17] It is different than the one we had with revscoring services [17:35:44] should we ask WME to stop? At the moment the pods are really melting :D [17:36:01] I guess it is not super easy to go multi-process right? [17:36:20] it has been like this for days + I tested sth on friday which didnt help [17:37:21] elukey: no it isnt the same issue we had. iiuc this is coming from the predict step [17:37:32] for now I'll merge this and revert my previous change https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1126061 [17:37:51] isaranto: the preprocess_ms is very high as well, this is why I thought it was similar [17:37:54] I tried multiprocessing on the kserve level by adding a uvicorn worker. [17:37:57] way more than predict_ms [17:38:00] ack, makes sense [17:38:56] hmm I guess it wouldnt hurt to try multiprocess for preprocess as well. [17:39:19] it could alleviate a bit the issue, if it is cpu-bound as well [17:39:28] no idea what those model server do [17:39:51] I need to go now, but I can check later if needed [17:40:15] the SREs have acked the incident but it paged, and the alert for the api-gateway is still ongoing [17:40:27] I'll drop a note to sync with you here [17:41:38] (ttl!) [17:41:44] FIRING: LiftWingServiceErrorRate: ... [17:41:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revision-models&var-backend=reference-quality-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [17:43:05] I have quite some work to do on this, so it won't be solved now. For now I will revert the change [17:43:35] (03CR) 10Ilias Sarantopoulos: [C:03+2] "merging to take care of alerts" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126061 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [17:44:06] this serves 2 models from the same pod so it also has to do with that. I am working on separating these services [17:44:20] (03Merged) 10jenkins-bot: reference-quality: revert multiple workers [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126061 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [17:46:44] FIRING: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [17:50:45] It looks like we started getting more traffics 40mins ago, which is causing problems https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revision-models&var-backend=reference-quality-predictor.%2A [17:56:27] Sorry, was out for dinner. ping me if anything needs doing [17:57:04] that is true , but I'll revert the change in any case as it didn't bring good results [17:58:43] I just redeployed eqiad [17:58:51] I will have to go in a bit [18:08:43] the new deployment is up but the pods are stuck [18:11:24] it is getting tons of errors from mwapi [18:11:44] I do wonder if we hit a ratelimit or sth there [18:12:47] hm I[m not aware [18:12:57] unfortunately I have to go. [18:13:18] klausman: could you bump the minreplicas to 4 or 8 on eqiad? [18:13:36] I dont think it will solve it but ... [18:13:37] will do. do you want me to do it all the way from dep-charts or live? [18:21:58] I've done it live, 8 replicas running, min set to 4 [21:46:44] FIRING: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [23:11:08] 06Machine-Learning-Team, 10EditCheck, 06Editing-team, 10VisualEditor: Evaluate efficacy of Peacock Check model output (internal review) - https://phabricator.wikimedia.org/T384651#10621517 (10ppelberg) [23:11:14] 06Machine-Learning-Team, 10EditCheck, 06Editing-team, 10VisualEditor: Evaluate efficacy of Peacock Check model output (internal review) - https://phabricator.wikimedia.org/T384651#10621519 (10ppelberg) [23:19:20] 06Machine-Learning-Team, 10EditCheck, 06Editing-team, 10VisualEditor: Evaluate efficacy of Peacock Check model output (internal review) - https://phabricator.wikimedia.org/T384651#10621526 (10ppelberg)