[05:43:27] 06Machine-Learning-Team: Article Summary Generation and Evaluation Pipeline using vLLM image - https://phabricator.wikimedia.org/T395246#10878312 (10kevinbazira) >>! In T395246#10876040, @isarantopoulos wrote: > I see some of the filenames are enclosed in either single or double quotes, while the [[ https://gerr... [06:29:55] mooorning! [06:44:13] georgekyz: I have changed the wikis we are going to rollout in the second batch and created this task https://phabricator.wikimedia.org/T395823 [06:44:30] I've updated your patch to reflect these changes https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1152682 [06:45:41] the only reason for the change was to sort them alphabetically. I was confused all the time when I was trying to remember in which batch each wiki belongs to and I think it would also confuse us during deployments [06:55:31] Good morning [06:59:29] good morning! [06:59:39] good morning [07:01:56] isaranto: Thnx for the change [07:24:09] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Host an OpenVINO model in LiftWing - https://phabricator.wikimedia.org/T395012#10878406 (10santhosh) Thanks @kevinbazira and @isarantopoulos for these details. Very useful information. Kevin, What you tried with openvino is low level openvino API. T... [07:31:06] georgekyz: we can proceed with deploying the extension for this batch without the UI this week [07:32:05] could you review the changes I made in your patch? (thresholds etc) [07:32:23] isaranto: that's perfect, do you have any preference regarding the date and time ? [07:36:04] any morning UTC backport window is ok with me. We could include one more person from the team to help with QA since it is 8 wikis we're talking about [07:48:36] shall we plan for Thursday morning? [07:50:32] yeap sure [07:51:06] I will update the patch with the corresponding changes for the specific wikis and I will schedule it [08:00:08] ok! georgekyz the patch should be ok I did all the changes (unless I missed sth) [08:03:30] alrighty [08:05:44] FIRING: LiftWingServiceErrorRate: ... [08:05:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-reverted&var-backend=viwiki-reverted-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [08:09:39] isaranto: scheduled\ [08:10:34] \o/ [08:10:44] RESOLVED: LiftWingServiceErrorRate: ... [08:10:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-reverted&var-backend=viwiki-reverted-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [08:13:35] bartosz: I invited you for the backport deployment process on Thursday, it would be nice to discuss and follow the process together as a nice onboarding task as well :P [08:14:50] georgekyz: thank you, I'm very happy to join! [08:23:06] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, and 4 others: [batch #2] Enable revertrisk filters in recent changes in multiple wikis - https://phabricator.wikimedia.org/T395823#10878496 (10isarantopoulos) [08:57:11] o/ klausman: is the PSS rollout on prod finished happily by now? I was thinking about doing some prod deployments after lunch [08:58:26] YEs, since Luca used the quiet day last week to do the last recycles, we're all good [09:00:20] o/ [09:00:41] bartosz: please let me know if you encounter any weirdness, there shouldn't be any but.. :) [09:04:37] sounds great, will let you know how it goes! [10:19:41] * klausman lunch [10:28:07] 06Machine-Learning-Team: Article Summary Generation and Evaluation Pipeline using vLLM image - https://phabricator.wikimedia.org/T395246#10878919 (10kevinbazira) I have added an optional summary evaluation step to the [[ https://gitlab.wikimedia.org/repos/machine-learning/exploratory-notebook/-/blob/main/simple-... [10:43:44] FIRING: LiftWingServiceErrorRate: ... [10:43:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=zhwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [10:53:44] FIRING: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [10:58:44] FIRING: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [11:22:31] Ruh roh.. [11:25:09] we seem to be getting a lot of 500s for revertrisk and zhwiki-damaging [11:25:09] https://grafana.wikimedia.org/goto/mpElACBHg?orgId=1 [11:25:15] in codfw [11:25:39] It seems we're hitting 500 error code due to failure in fetching revisions [11:26:15] `ERROR:root:An error has occurred while fetching revisions: [1293736750] (). Reason: 502, message='Bad Gateway', url=URL('http://mw-api-int-ro.discovery.wmnet:4680/w/api.php?action=query&formatversion=2&prop=revisions&revids=1293736750&rvprop=ids%7Ccontent%7Ccomment%7Ctimestamp%7Csize%7Cuserid%7Ctags&rvslots=main&format=json'` [11:27:39] and it started at 10am UTC today https://grafana.wikimedia.org/goto/dJy60CBHR?orgId=1 [11:30:05] going to a meeting -- be back later [11:30:51] klausman: could you help figure out if this is a networking issue (e.g. related to istio) or sth else? [11:32:50] 06Machine-Learning-Team: Build and push images to the docker registry from ml-lab - https://phabricator.wikimedia.org/T394778#10879202 (10isarantopoulos) p:05Triage→03High [11:52:51] Looking [11:55:09] https://phabricator.wikimedia.org/P76928 seems to indicate an APIGW or service-behind-APIGW problem [12:01:37] seems to be hitting the mw-api directly though [12:01:57] mostly WME afaics, getting the 500s [12:02:09] I tried one of the reported-bad queries with curl from the dpeloyment host and it seems to work [12:05:33] klausman: I added a snippet to your paste, it is related to envoy on rr pods calling the mw api [12:05:51] nothing suspicious except "route_name":"block_all", [12:06:10] now IIUC it may be a domain that we haven't mapped/allowed in the service entry [12:06:37] hmm, that's a good point [12:06:52] I also added a working curl call that I adapted from a query that 500'd [12:07:11] yeah see ,"authority":".wikipedia.org [12:07:32] on a working log we have stuff like "authority":"ru.wikipedia.org" [12:08:11] but a leading dot makes no sense, is that coming from the client? [12:08:40] IIRC we set it when fetching the features, but it may come from something that the client sets in the json payload [12:09:06] if you grep for "response_code":500 in istio-proxy you'll see that it is mostly WME calling us [12:09:12] but we don't see the json payload [12:09:24] Yeah, unfortunately, we don't know what the queries are [12:09:32] ah yes in the kserve logs [12:09:33] INFO:root:Unsupported lang: . [12:10:08] so it seems a bug on our side, we should probably return 400s [12:10:16] So this sounds to me as if we're either getting malformed queries, or we are not supporting something we should [12:10:58] exactly yes, same feeling [12:11:30] I bet they are not setting the "lang" field in json [12:11:49] Or it's empty. Both of which we should throw 400s back for [12:12:04] yep [12:12:04] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10879311 (10OKarakaya-WMF) I've reproduced the results for `uzwiki` by using [akhatun/research-mwaddlink](https://gitlab.wikimedia.org... [12:12:40] It's odd that we log unsupported lang, but then continue to make a call to MWAPI. Or maybe it's because we do it in parallel with async? [12:13:35] no we don't, I think it is a check missing when reviewing the inputs [12:14:04] I found some code, hang on [12:14:19] https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/src/models/revert_risk_model/model_server/base_model.py#71 The code that generates the log message is here [12:14:42] and it's called here https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/src/models/revert_risk_model/model_server/base_model.py#125 [12:14:54] But aside from logging the message, we do nothing with the information?! [12:15:40] isaranto aiko: ^^^ Is this code working as intended? [12:16:12] it is a bug, we are missing an InvalidInput [12:16:48] You mean like on L.80? [12:17:04] yep [12:17:09] I can make a patch [12:17:55] to me check_wiki_suffix and check_supported_wikis could be merged into one, raising the error etc.. [12:18:05] o/ back [12:18:21] because then we do rev = await get_revision(session, rev_id, lang) without checking lang [12:18:27] no bueno [12:19:28] hmm I don't recall what should be the best approach here since we don't want to throw an InvalidInput error (400) in case there is a unsupported language as we want to use some default values [12:19:49] the thing is that it needs to be a proper wiki language, so there should be some additional check [12:21:03] For the moment, I think we can still raise an InvalidInput if it is empty [12:21:18] here is the related change https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1023809 which indeed does miss the random language parameter [12:21:42] ack [12:21:46] agree I mean [12:21:59] kevinbazira: could you take care of this alert? [12:22:18] isaranto: sure sure [12:22:28] thanks for investigating klausman and elukey ! [12:22:33] Feel free to send reviews my way [12:22:44] thank you kevin 🙏 [12:33:31] I've added a silence for the alert [12:35:55] thank you for troubleshooting this klausman and elukey. I've pushed a fix here: https://gerrit.wikimedia.org/r/1153139 [12:35:57] please review whenever you get a minute. thanks! [12:38:40] on it [12:40:30] +1 from me as well [12:42:17] I came too late to the party with my -1 [12:42:29] as Ilias mentioned, we don't want to throw an InvalidInput error for unsupported language as we want to use some default values. [12:43:10] (03CR) 10Ilias Sarantopoulos: [C:04-1] "for the languages that were initially unsupported by the model we want to use the default feature values," [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1153139 (owner: 10Kevin Bazira) [12:43:11] we want to throw error for empty lang for now right? [12:43:20] At a minimum, yes [12:43:35] a long-term solution for this would be checking the languages against canonical-data.wikis https://gitlab.wikimedia.org/repos/movement-insights/canonical-data/-/blob/main/wiki/README.md [12:44:46] to check if it is a proper wiki language [12:44:48] isaranto: I think your -1 preempted the merge [12:44:53] if we merge the above we might break existing functionality. So a quick fix would be to check for empty strings and then implement what aiko suggests above --^ [12:47:55] okok fixing ... [12:48:29] yeah because that was the functionality enterprise asked [12:53:44] FIRING: LiftWingServiceErrorRate: ... [12:53:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revision-models&var-backend=reference-risk-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [13:05:36] https://phabricator.wikimedia.org/P76947 is logged for this error ^^^^ [13:06:26] (03CR) 10Dreamy Jazz: Add revertrisk_score AbuseFilter variable (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1152270 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [13:08:32] (03CR) 10Dreamy Jazz: Add revertrisk_score AbuseFilter variable (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1152270 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [13:09:22] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 10Moderator-Tools-Team (Kanban), 13Patch-For-Review: [batch #1] Enable revertrisk filters in simplewiki & trwiki - https://phabricator.wikimedia.org/T395668#10879410 (10isarantopoulos) [13:15:04] thank you for providing the context aiko and isaranto. here is the correct patch: https://gerrit.wikimedia.org/r/1153143 [13:15:04] please take a look whenever you get a minute. thanks! [13:17:52] Added a full backtrace to https://phabricator.wikimedia.org/P76947 --- I am not sure if this is related to the earlier error (leaning towards no) [13:20:45] ty! [13:35:23] (03CR) 10AikoChou: [C:03+1] "LGTM! Thanks for taking care of this :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1153143 (owner: 10Kevin Bazira) [13:35:54] klausman: I re-deployed the reference-risk model with new docker image around 1h ago, this might be related [13:36:37] thanks for the review Aiko. isaranto: should we proceed with a merge? [13:37:16] bartosz: the IDNA error makes me wonder if this is new behavior that used to be quiet [13:37:48] i';m in a meeting [13:37:56] okok [13:38:04] but I think we can just change a check that we have in the fucntion check_input params [13:38:50] kevinbazira: I think we also need to add the check to batch_model.py > https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/src/models/revert_risk_model/model_server/batch_model.py#153 [13:38:51] from if value is None: to `if not value:` cause the first one wouldnt capture an emtpy string [13:39:22] RRLA uses batch_model.py [13:39:30] that would be line 18 in preprocess_utils.py and it would fix things [13:39:31] (03PS7) 10Máté Szabó: Add revertrisk_score AbuseFilter variable [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1152270 (https://phabricator.wikimedia.org/T364705) [13:39:40] (03CR) 10Máté Szabó: Add revertrisk_score AbuseFilter variable (032 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1152270 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [13:39:57] (03CR) 10AikoChou: RRLA: Raise invalid input error for empty langs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1153143 (owner: 10Kevin Bazira) [13:41:20] (03CR) 10AikoChou: "ahhh wait" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1153143 (owner: 10Kevin Bazira) [13:41:31] `if not value` also captures empty string "" [13:42:05] yes but now we are using `if value is None` which doesn't [13:42:06] klausman: Could it be that it's the same error in fetching revision, but IDNA cannot print such long URL in the error message? [13:42:36] No idea, I am not familiar with what IDNA is trying to do there [13:44:18] isaranto: yep, you're right! we can just change the preprocess_utils [13:44:41] on it [13:48:50] klausman: my suspicion would be on this open cpython issue: https://github.com/python/cpython/issues/77139. not sure why would we see it only now tho [13:50:05] done. a couple of model-servers depend on `python.preprocess_utils` so a number of tests are running [13:51:09] bartosz: maybe this bug was introduced with the recent Python interpreter updates? [13:52:14] mh, come to think of it, this may be the same problem as the earlier error. [13:52:26] IDNA complains about too long a label or _an empty one_ [13:52:55] A label in this context is a segment of a domain name, e.g. "www" and "org" in www.wikimedia.org [13:53:29] if the language we get from the client is empty, we might construct a domain name like .wikimedia.org, which has an empty label at the start. [13:54:33] And they're happening at the same time since some WMDE client has started using these services, and is not providing a language [13:55:25] I see, this makes more sense now [13:56:02] I am not 100% sure this is what's ahppening, but it seems likely. [13:56:13] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM, Thank you!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1153143 (owner: 10Kevin Bazira) [14:03:44] RESOLVED: LiftWingServiceErrorRate: ... [14:03:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revision-models&var-backend=reference-risk-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [14:30:41] 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Operational Excellence - LiftWing Platform Updates & Improvements - https://phabricator.wikimedia.org/T391943#10879698 (10isarantopoulos) [14:40:11] 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Operational Excellence - LiftWing Platform Updates & Improvements - https://phabricator.wikimedia.org/T391943#10879781 (10isarantopoulos) [14:40:47] 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Operational Excellence - LiftWing Platform Updates & Improvements - https://phabricator.wikimedia.org/T391943#10879786 (10isarantopoulos) [14:42:03] 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Operational Excellence - LiftWing Platform Updates & Improvements - https://phabricator.wikimedia.org/T391943#10879791 (10isarantopoulos) [14:42:34] 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Operational Excellence - LiftWing Platform Updates & Improvements - https://phabricator.wikimedia.org/T391943#10879792 (10isarantopoulos) [15:11:13] (03CR) 10CI reject: [V:04-1] LiftWingService: Add tests [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1152267 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [15:11:40] (03CR) 10Dreamy Jazz: "recheck" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1152267 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [15:11:55] (03PS4) 10Kevin Bazira: RRLA: Raise invalid input error for empty langs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1153143 [15:12:00] (03CR) 10Dreamy Jazz: [C:03+2] "Tests pass -> LGTM" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1152267 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [15:12:28] (03CR) 10Dreamy Jazz: LiftWingService: Add tests [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1152267 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [15:12:41] (03CR) 10Dreamy Jazz: [C:03+2] LiftWingService: Add tests [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1152267 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [15:18:24] (03CR) 10Ilias Sarantopoulos: "it seems like we have removed pytest-asyncio in https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1147008" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1153143 (owner: 10Kevin Bazira) [15:18:45] (03CR) 10Ilias Sarantopoulos: [C:03+1] RRLA: Raise invalid input error for empty langs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1153143 (owner: 10Kevin Bazira) [15:22:02] (03CR) 10Kevin Bazira: "thank you Aiko and Ilias for the reviews!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1153143 (owner: 10Kevin Bazira) [15:23:55] kevinbazira: thank you so much for fixing this! [15:24:24] thanks for the reviews Ilias! going to merge ... [15:25:32] (03CR) 10Kevin Bazira: [C:03+2] "yep, that's a mystery. going to merge ..." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1153143 (owner: 10Kevin Bazira) [15:29:01] (03Merged) 10jenkins-bot: LiftWingService: Add tests [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1152267 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [15:29:02] (03Merged) 10jenkins-bot: LiftWingService: Unify request creation [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1152268 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [15:35:14] (03Merged) 10jenkins-bot: RRLA: Raise invalid input error for empty langs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1153143 (owner: 10Kevin Bazira) [15:54:26] (03CR) 10Kevin Bazira: [C:03+2] "thank you for the context. this has been fixed in: https://gerrit.wikimedia.org/r/1153143" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1153139 (owner: 10Kevin Bazira) [15:57:06] * isaranto afk! [17:09:14] 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Productionize peacock detection model - https://phabricator.wikimedia.org/T391940#10880817 (10achou) [17:11:22] ---^ added more info in the task description [17:19:25] (03PS1) 10Nik Gkountas: Add "page-collection-groups" endpoint [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1153311 (https://phabricator.wikimedia.org/T374695) [17:19:26] (03PS1) 10Nik Gkountas: Add support for fetching collection group recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1153312 (https://phabricator.wikimedia.org/T374695) [17:20:48] (03CR) 10CI reject: [V:04-1] Add support for fetching collection group recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1153312 (https://phabricator.wikimedia.org/T374695) (owner: 10Nik Gkountas) [17:24:00] (03PS2) 10Nik Gkountas: Add support for fetching collection group recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1153312 (https://phabricator.wikimedia.org/T374695)