[02:15:54] (03CR) 10Kevin Bazira: [C:03+2] "Thank you for working on this, Andrew. LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1189552 (https://phabricator.wikimedia.org/T403664) (owner: 10Ottomata) [02:20:14] (03Merged) 10jenkins-bot: Bump version of mediawiki/page/prediction_classification_change event schema [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1189552 (https://phabricator.wikimedia.org/T403664) (owner: 10Ottomata) [03:20:59] FIRING: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [04:41:38] o/ ottomata thanks for patch , we'll deploy this now [04:41:55] thanks Kevin for the review! [05:03:42] folks I'm merging this patch to make the 500s go away. https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1189626 [05:04:41] I'll open follow up tasks for this so that we can discuss as it isnt a desired behavior and we should treat it differently [05:08:40] isaranto: o/ ack [05:10:14] kevinbazira: a sorry wasn't sure if anybody was around so I self reviewed the image versions [05:10:23] np :) [05:14:18] I've deployed the 3 services that use eventstreams and all 500s seem to have stopped so alerts will also be resolved soon [05:14:34] https://grafana.wikimedia.org/goto/8isb3bjHR?orgId=1 [05:15:44] RESOLVED: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [05:17:00] I've double checked them all on the istio dashboards and everything is fine [05:17:53] an interesting thing was that in revertrisk we didn't get an alert as the rate of 500s wasn't above the threshold. I don't recall what that threshold is but I'll look into it later [05:18:21] aaannd good morning :D [05:18:36] * isaranto afk be back in 1h [05:22:02] Maybe a bit more ,but hope things are fine! [05:32:01] yep, things seem fine. thank you for deploying this change! [06:52:45] 06Machine-Learning-Team: Build and push images to the docker registry from ml-lab - https://phabricator.wikimedia.org/T394778#11196382 (10elukey) >>! In T394778#11193036, @isarantopoulos wrote: > Following up on this task as it is quite important for the proper utilization of the new GPU hosts. >>>! In T394778#... [07:00:12] back! [07:03:30] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11196403 (10dcausse) >>! In T401021#11195972, @KStoller-WMF wrote: >>>! In T4010... [07:22:10] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11196439 (10achou) >>! In T401021#11196403, @dcausse wrote: > I guess it all dep... [07:37:42] morning :) [07:41:58] isaranto, kevinbazira: thanks for taking care of the fix for page_prediction events! [07:48:50] It's a pleasure! [07:49:28] it was great that Andrew found it and fixed it, helped a ton [08:11:28] FIRING: [6x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing [08:14:02] (03CR) 10Bartosz Wójtowicz: outlink-topic-model: Merge transformer and predictor pods. (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1187739 (https://phabricator.wikimedia.org/T404294) (owner: 10Bartosz Wójtowicz) [08:35:43] 10Lift-Wing, 06Machine-Learning-Team: Event streams schema change - https://phabricator.wikimedia.org/T405067 (10isarantopoulos) 03NEW [08:36:29] 10Lift-Wing, 06Machine-Learning-Team: prediction_classification_change stream schema change causes model server failures - https://phabricator.wikimedia.org/T405067#11196573 (10isarantopoulos) [08:37:25] 10Lift-Wing, 06Machine-Learning-Team: prediction_classification_change stream schema change causes model server failures - https://phabricator.wikimedia.org/T405067#11196574 (10isarantopoulos) The above incident has already been resolved by the following patches: https://gerrit.wikimedia.org/r/c/machinelearnin... [08:37:41] I created the above task that has all the information [08:52:57] Morning! [08:54:18] So I thought about the admin_ng discrepancies some more, and on second thought, I think keeping the higher limit ranges in staging is the better approach. When moving from staging to prod, we need to take a look at resource consumption anyway (due to different load), so adjusting from the looser limits in staging to the tighter ones in prod needs to be done anyway. [08:58:53] the risk is that any load test or similar done in staging is not applicable to production for various constraints, and/or that some subtle issues are not caught when testing in staging but in prrod [09:01:35] but let's see how it goes [09:01:51] Now I'm on the fence again :D [09:02:25] 10Lift-Wing, 06Machine-Learning-Team: prediction_classification_change stream schema change causes model server failures - https://phabricator.wikimedia.org/T405067#11196646 (10isarantopoulos) Following up on this also the ValidationErrorson EventGate have disappeared after the deployment (source: [[ https://g... [09:02:48] nah we can try this road, if we see that it leads to some troubles we can adapt [09:03:34] at this point the missing bit is what we want to do after a model transitioned to prod [09:03:54] because at that point it got the resource limits - should they also be applied to staging at that point? [09:04:31] (in theory this will be done before going to prod in the future, but we have various services now in staging and I don't know if all of them have the proper restrictions) [09:05:48] I think that the prod limits, once deemed appropriate, should be applied to staging as well, unless there is a pressing (and documented!) reason to have different ones. [09:06:43] (also, I'll be applying the admin_ng changes for staging in a moment) [09:09:05] make sense yes. Thinking out loud - prod have some base limits, and staging has other ones (more lenient and permissive etc..). I think, but I am not 100% sure, that staging gets the prod's limits (like in this case) only if we specific override/set the limits in the prod's yaml file [09:13:32] it makes sense to have some higher limits in staging to be able to experiment various resource configurations & run load tests. This will give us information about what kind of resources we are going to use in prod [09:22:32] I did a quick test from the deployment server, and some services would have gotten smaller limit ranges. I'll pastebin the diff in a sec [09:23:33] https://phabricator.wikimedia.org/P83440 [09:24:07] lines 74-77 are a c&p error on my part [09:27:29] and fixed. [09:44:52] elukey: you think it's ok to push the noop bits to prod on a Friday? [09:48:15] (03CR) 10Gkyziridis: [C:03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1187739 (https://phabricator.wikimedia.org/T404294) (owner: 10Bartosz Wójtowicz) [09:49:11] bartosz: +1 the patch. Thnx for working on that one! [09:50:14] klausman: yep [09:50:36] but you can do it on monday as well [09:50:37] no rush [09:56:13] georgekyz: thanks a lot! [10:01:28] FIRING: [6x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing [10:01:57] elukey: I'd rather get those ^^^ silenced :) [10:03:45] yep np :) [10:06:28] FIRING: [6x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing [10:21:49] dropping here something interesting on data modelling https://wikitech.wikimedia.org/wiki/Data_Platform/Data_modeling_guidelines#WMF-specific_Conventions [10:40:30] (03CR) 10Bartosz Wójtowicz: [C:03+2] "Going forward with this one!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1187739 (https://phabricator.wikimedia.org/T404294) (owner: 10Bartosz Wójtowicz) [10:49:52] (03Merged) 10jenkins-bot: outlink-topic-model: Merge transformer and predictor pods. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1187739 (https://phabricator.wikimedia.org/T404294) (owner: 10Bartosz Wójtowicz) [11:01:28] FIRING: [4x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing [11:02:36] Welp, the _dashboard_ (alerts.wm.o) says the stopped firing [11:02:47] * klausman lunch [11:03:34] nice, thanks Tobias! [11:06:28] RESOLVED: [4x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing [11:07:11] there we go :) [11:18:59] After merging my patch, the `revertrisk-wikidata` and `nsfw` postmerge image build pipelines failed with: `3.022 E: The repository 'http://mirrors.wikimedia.org/debian bullseye-backports Release' does not have a Release file.` as those are the only 2 remaining models pipelines we have in our repo [11:19:35] I think we expect this as those are non-production models, but should we somehow prevent this from happening by e.g. cleaning up those models and/or their ci pipelines? [11:20:22] ^^ meant that those are 2 remaining bullseye models in our repo [11:26:47] we should definitely remove the nsfw pipeline as we dont use it. We could also remove the rrwikidata one , and add it again when we actually productionize the latest model [11:28:17] I'm not sure what the errors actually refer to though [11:31:56] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review - https://phabricator.wikimedia.org/T392833#11197024 (10BWojtowicz-WMF) **Weekly Report** Summary of progress: 1. The cache design has been posted to review for the Data Persistenc... [11:37:42] 10Lift-Wing, 06Machine-Learning-Team: Remove old nsfw model from inference-services repo - https://phabricator.wikimedia.org/T405083 (10isarantopoulos) 03NEW [11:43:29] (03PS1) 10Ilias Sarantopoulos: nsfw: remove blubber images and code [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1189846 (https://phabricator.wikimedia.org/T405083) [11:43:51] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Remove old nsfw model from inference-services repo - https://phabricator.wikimedia.org/T405083#11197060 (10isarantopoulos) a:03isarantopoulos [11:44:59] 10Lift-Wing, 06Machine-Learning-Team: prediction_classification_change stream schema change causes model server failures - https://phabricator.wikimedia.org/T405067#11197062 (10isarantopoulos) a:03isarantopoulos [11:45:58] I'd be looking for a small review to deploy the articletopic model on staging using only predictor pod 🥺 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1189839 [11:47:10] let's go! [11:47:15] +1 from me [11:47:18] (03CR) 10Bartosz Wójtowicz: [C:03+1] "LGTM, thank you for tackling this so swiftly <3" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1189846 (https://phabricator.wikimedia.org/T405083) (owner: 10Ilias Sarantopoulos) [11:47:26] I also added some patches for nsfw model [11:49:30] thank you! already +1'd the patch removing nsfw model :D [11:51:34] I'll wait for the integration/config changes to be merged first https://gerrit.wikimedia.org/r/c/integration/config/+/1189843 [12:00:26] It seems that my previous patch was not enough to remove the transformer from staging, as we still inherit its deployment configuration from the main (production) values file, where we want to keep it for now.. I'm wondering if this would make sure we remove this pod from staging: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1189850 [12:21:57] well it seems like it works! [12:22:01] https://integration.wikimedia.org/ci/job/helm-lint/27189/console [12:26:29] bartosz: I left a comment regarding the autoscaling config. other than that LGTM [12:27:09] isaranto: very cool, forgot that helm lint shows a nice diff like this! already pushed a patchset removing the unnecessary line in autoscaling config [12:29:20] +1 [12:33:09] thank you! merging it [12:39:34] hmm interesting thing happened after syncing - it deployed a new version of predictor, terminated old predictor, but the old transformer is still running [12:42:40] however, the new deployment seems to work! I can query the model and all processing is done via newly-deployed predictor [12:44:20] should we terminate the old transformer deployment manually in this case? [12:52:29] yep. klausman could you do that? [12:52:44] delete the pod in articletopic-outlink namespace in ml-staging-codfw [12:53:23] I think we need to go for removing full deployment `outlink-topic-model-transformer-default-00030-deployment` 🥺 [12:53:43] yeah I see also a revision for that outlink-topic-model-transformer-default-00030 [12:53:44] hmm [12:56:53] yeah the transformer component is in the isvc definition. So all we did was that we created a new revision for the predictor, but the previous revision of the transformer is also there [12:57:21] (03PS1) 10Sbisson: Downgrade unknow prefix from error to debug [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1189865 (https://phabricator.wikimedia.org/T404976) [13:07:01] isaranto: Do you know if it's possible to remove one of the existing components by changing the helm configuration and running `helm sync`? Or will it require we re-deploy the isvc alltogether? [13:09:36] we'd have to alter the infrenceservice object that is created. Since the requests are routed through the predictor I wouldn't do anything else, and then we can just remove all the transformer config from the values.yaml and deploy on prod on monday [13:12:06] I see! Will run the load tests on staging and we can deploy on prod on Monday [13:12:35] I think I don't see why the same issue won't occur in prod if we delete all transformer config from values.yaml [13:13:06] Or do you mean that it will and we will just alter the isvc config manually? [13:16:36] it won't occur because it will be deleted so there will be no transformer object. Right now the transformer is declared in values.yaml so iiuc by providing a null value we remove the pod but not the transformer config that is defined in the inferenceservice. [13:17:18] this whole confusion is caused because the inference services declared in the yaml are actually a dict so the result is the merged dict of values.yaml and values_staging.yaml [13:19:49] hmm I think I see, so if there would be a way to delete the `transformer` key when merging the dicts instead of overwriting its value to `null`, the config in staging would also not include the transformer anymore [13:20:06] thank you for elaborating! [13:22:05] isaranto: sorry, got lost in YAML. What needs terminating? outlink-topic-model-transformer-default-00030-deployment? [13:22:40] (03PS1) 10Sbisson: Warn when a page collection contains no valid links [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1189872 (https://phabricator.wikimedia.org/T404976) [13:22:46] klausman: perhaps nothing , I think we can leave that for monday [13:23:32] I mean, I can do it now. It's not prod [13:23:42] bartosz: if we changed from dict to list/array I think we'd fix this, but then we'd have to declare all the values in the staging.yaml [13:24:50] klausman: bartosz now that I rethought about this I believe it is best to leave it and see if the transformer will go away when we totally remove everything from values.yaml I created a patch and set it as WIP https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1189871 [13:24:59] wdyt? [13:25:18] sgtm [13:25:52] oh, btw, the ores-legacy app would have an update if synced, since there's a new version of Python webapp. [13:25:57] perhaps it won't so we'll know that we'd have to remove it manually from prod as well [13:26:07] yeah, testing that is a good idea [13:30:13] 06Machine-Learning-Team, 07Essential-Work: Enable alerts for outdated admin_ng charts for ml-team - https://phabricator.wikimedia.org/T403047#11197482 (10klausman) 05Open→03Resolved p:05Triage→03Low This has been deployed and confirmed working. [13:32:29] isaranto: thank you for creating the patch! I agree that it'll be nice to check if it goes away when we change the main values as well. Let's deploy and test on Monday [13:53:41] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review: Data Persistence Design Review: Article topic model caching - https://phabricator.wikimedia.org/T402984#11197557 (10isarantopoulos) >>! In T402984#11193553, @Ottomata wrote: > Suggestion to standardize wiki differentiation on `wik... [13:59:53] 06Machine-Learning-Team, 05Goal, 13Patch-For-Review: Q1 FY2025-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines - https://phabricator.wikimedia.org/T398950#11197581 (10OKarakaya-WMF) ###__**Reporting (19/09/2025)**__ **Progress update on the hypothesis for the week, including if... [14:40:24] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11197656 (10achou) >>! In T401021#11190788, @Ottomata wrote: > - `wiki` - I pre... [15:15:21] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q1 FY2025-26 Goal: Apply the Tone Check model to published articles, to learn whether we can build a pool of high-quality structured tasks for new editors - https://phabricator.wikimedia.org/T392283#11197715 (10achou) **Weekly Report** Progress update on the hypo... [15:51:21] isaranto, klausman - FYI https://gerrit.wikimedia.org/r/c/operations/puppet/+/1189886 [15:51:45] so the SRE team is restoring a limit for the docker layers, that is 4.5GB compressed [15:52:04] the main reason is that the swift backend, used to store the binaries, doesn't support anything bigger than 5GB [15:52:35] we are testing the new Ceph backend, apus, that should circumvent the problem, but for now 5GB is a hard limit [15:52:43] let's discuss options on monday [15:52:51] Okkk thanks for mentioning it [15:53:07] Going afk just now,but I'll review later [18:44:22] * isaranto sighs [18:44:42] I hope we don't face issues with the newer versions of the rocm/vllm images