[07:10:03] 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work: Create event stream for article-country model-server hosted on LiftWing - https://phabricator.wikimedia.org/T382295#10527740 (10kevinbazira) Thanks @Ottomata and @dcausse for the confirmation. The article-country model-server that supports both streams has be... [07:54:18] 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work: Create event stream for article-country model-server hosted on LiftWing - https://phabricator.wikimedia.org/T382295#10527773 (10kevinbazira) [07:57:51] 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work: Stop publishing events without article-country predictions - https://phabricator.wikimedia.org/T385771 (10kevinbazira) 03NEW [08:31:50] hello folks! [08:44:33] (03PS1) 10Kevin Bazira: article-country: don't publish events when prediction results are empty [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1117840 (https://phabricator.wikimedia.org/T385771) [09:00:43] 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work: Create event stream for article-country model-server hosted on LiftWing - https://phabricator.wikimedia.org/T382295#10527889 (10Ottomata) It's beautiful! https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:eventstreams,mediawiki.article... [09:08:56] good morning [09:10:39] Kalimera! [09:40:03] Buongiorno! [10:02:55] I'm building the rocm base image and my laptop is almost frozen [10:13:53] :( [10:16:15] Yesterday night the simulation ran correctly and having some results which look ok (at least no padding words). [10:16:15] I can now load the model and generate the same results. I also found some better way to filter the training dataset. [10:16:15] Check this paste if you have time: https://phabricator.wikimedia.org/P73142#293796 [10:26:00] it seems that the losses are too high [10:27:01] did you figure out the issue with the padding tokens? [10:33:22] does anyone know the maximum cpu/memory we can allocate in a namespace? I tried to deploy ref-quality in staging with more cpus and memory, but there was no response after sync [10:33:35] I'm wondering if this might be because it exceeds the resource limitations [10:37:44] aiko: check kubectl get events for the namespace, it you crossed the limits you'll see an error there [10:39:02] isaranto: Nope, I do not understand... I can now load the quantized model and reproduce exactly the same results. I could not find what happened yesterday [10:39:47] elukey: thanksss! [10:40:04] indeeed it crossed the limits [10:40:10] "maximum cpu usage per Container is 8, but limit is 12, maximum cpu usage per Pod is 10, but limit is 15" [10:44:37] the limits are under deployment-chart's admin_ng, the values.yaml file for ml-serve should list all the namespaces and their config [10:44:45] if there is no config for a namespace, it gets the default [10:44:52] that is defined in admin_ng for all clusters [10:44:59] I need to go but ping me later if you have troubles [10:45:29] I can check with Aiko Luca, thanks for the help! [10:47:49] aiko: you can check kubectl get resourcequota to see the namespace quotas [10:51:40] Apart from the resource quotas the limitranges are important. There is a specific limit per pod and per container. The default I see don't specify cpu so iiuc it just gets the default which is 8 [10:51:47] ``` [10:51:47] maximum cpu usage per Container is 8, but limit is 12, maximum cpu usage per Pod is 10, but limit is 15 [10:51:47] ``` [10:52:55] isaranto: right! [10:52:57] So if we need to bump this up we'll need to add a limitrange entry under the revision-models key in the yaml, similar to what we have for article-descriptions here https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/values/ml-serve.yaml#225 [10:54:30] so there are 4 things to take into consideration : [10:54:31] 1. cpu per container (limitrange) [10:54:31] 2. cpu per pod (limitrange) [10:54:31] 3. total cpu of namespace (resourcequota) [10:54:31] 4. total cpu of node [10:54:50] and as it is reasonable 4 > 3 > 2 > 1 [10:55:21] ty!! I checked the resourcequota, revision-models ns has 90 cpus and 100Gi mem. definitely enough. so we need to change limitranges for container and pod [10:55:41] I'll file a patch for that! [10:56:08] ok, I can review anytime. However we'll need Tobias to deploy it in the admin_ng ns [10:57:38] ack [11:08:43] Morning! [11:09:00] I can also assist with the limits (and deploying them) [11:20:56] the patch: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1117873 [11:21:10] Looking [11:22:49] So the intent is to give the rev-models NS the same amount of resources aa article-desc? [11:27:31] aiko: I've added a c&p of what the rpduction diff would look like to the Gerrit review [11:29:14] o/ Tobias [11:33:45] yeah I think 16/24 are reasonable numbers as we ask 12 for ref-quality [11:34:00] for now [11:34:45] roger [11:35:09] I can +2 for merge unless Ilias wants to add anything [11:41:29] Alright, proceeding [11:46:04] aiko Isn't 44 Gi per pod too much? [11:48:52] maybe.. for mem I just follow what article-desc has [11:50:04] That one was specific because the model was big [11:50:05] ahh it's merged [11:50:38] I can file another patch to change that [11:50:40] No worries, we can check memory usage and put something more reasonable [11:51:10] ok! [11:51:22] ack, will push to the clusters in a moment [11:54:05] ok, staging done [11:54:24] aiko: did only staging bump into the limits so far, or prod as well? [11:54:53] (or put another way: should I also push to prod immediately?) [12:08:29] klausman: yep, I haven't deployed to prod, but there'd be the same issue there [12:08:33] I think sth like 10Gi per pod would be enough, but we can wait to see the resource usage first [12:09:15] the resourcequota in the ns is 100Gi so 44Gi per pod wouldn't work if we have 3 replicas [12:09:40] it is not likely to happen, but it is good to have limitranges that make some sense [12:16:19] thanks for taking care of that Aiko & Tobias [12:22:43] ack, pushed to all three clusters [12:22:56] * klausman lunch (but I'm in shouting distance) [12:58:27] * isaranto lunch! [14:51:08] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Quantize aya-expanse-32B with GPTQ (GPTQModel) - https://phabricator.wikimedia.org/T384734#10528898 (10isarantopoulos) Building a wheel from source with the latest release (v.1.7.4) is successful. However building it from the tip of the main branch fa... [14:52:50] forgot to update you folks - ml-staging-codfw is running with the new knative, injecting security settings to (most of) the containers of pods being created [14:53:10] it shouldn't cause any trouble, in case please ping me or report the issue in https://phabricator.wikimedia.org/T369493 [14:53:30] sadly the work is not yet completed, I need to find a way to inject the security context also on the istio containers [14:53:50] once done, we'll be ready to upgrade to the new k8s version anytime [14:59:50] ack, thanks Luca! [15:57:20] good morning all [16:08:17] o/ whenever you get a minute please review: https://gerrit.wikimedia.org/r/1117840 [16:11:37] o/ Chris [16:12:21] (03CR) 10Ilias Sarantopoulos: [C:03+1] article-country: don't publish events when prediction results are empty [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1117840 (https://phabricator.wikimedia.org/T385771) (owner: 10Kevin Bazira) [16:16:50] (03CR) 10Kevin Bazira: [C:03+2] "thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1117840 (https://phabricator.wikimedia.org/T385771) (owner: 10Kevin Bazira) [16:17:36] (03Merged) 10jenkins-bot: article-country: don't publish events when prediction results are empty [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1117840 (https://phabricator.wikimedia.org/T385771) (owner: 10Kevin Bazira) [17:07:27] all right I think I found a way to fix the last containers [17:07:29] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1117939 [17:07:51] worked on ml-staging enforcing the PSS restricted profile [17:10:02] I am off till next week, and then there is the SRE summit, so I'll merge it sometimes next week :) [17:10:05] o/ [17:17:31] \o [17:35:33] \o [17:35:55] * isaranto afk [22:30:27] 06Machine-Learning-Team, 06Data-Engineering, 10Event-Platform: Create new mediawiki.page_links_change stream based on fragment/mediawiki/state/change/page - https://phabricator.wikimedia.org/T331399#10530563 (10Ladsgroup) Thanks! That can be quite useful. I might try it out in the hackathon