[07:10:03] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work: Create event stream for article-country model-server hosted on LiftWing - https://phabricator.wikimedia.org/T382295#10527740 (10kevinbazira) Thanks @Ottomata and @dcausse for the confirmation. The article-country model-server that supports both streams has be...
[07:54:18] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work: Create event stream for article-country model-server hosted on LiftWing - https://phabricator.wikimedia.org/T382295#10527773 (10kevinbazira)
[07:57:51] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work: Stop publishing events without article-country predictions - https://phabricator.wikimedia.org/T385771 (10kevinbazira) 03NEW
[08:31:50] <isaranto>	 hello folks!
[08:44:33] <wikibugs>	 (03PS1) 10Kevin Bazira: article-country: don't publish events when prediction results are empty [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1117840 (https://phabricator.wikimedia.org/T385771)
[09:00:43] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work: Create event stream for article-country model-server hosted on LiftWing - https://phabricator.wikimedia.org/T382295#10527889 (10Ottomata) It's beautiful!  https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:eventstreams,mediawiki.article...
[09:08:56] <georgekyz>	 good morning
[09:10:39] <isaranto>	 Kalimera!
[09:40:03] <aiko>	 Buongiorno!
[10:02:55] <isaranto>	 I'm building the rocm base image and my laptop is almost frozen
[10:13:53] <georgekyz>	 :( 
[10:16:15] <georgekyz>	 Yesterday night the simulation ran correctly and having some results which look ok (at least no padding words).
[10:16:15] <georgekyz>	  I can now load the model and generate the same results. I also found some better way to filter the training dataset.
[10:16:15] <georgekyz>	 Check this paste if you have time: https://phabricator.wikimedia.org/P73142#293796
[10:26:00] <isaranto>	 it seems that the losses are too high
[10:27:01] <isaranto>	 did you figure out the issue with the padding tokens?
[10:33:22] <aiko>	 does anyone know the maximum cpu/memory we can allocate in a namespace? I tried to deploy ref-quality in staging with more cpus and memory, but there was no response after sync
[10:33:35] <aiko>	 I'm wondering if this might be because it exceeds the resource limitations
[10:37:44] <elukey>	 aiko: check kubectl get events for the namespace, it you crossed the limits you'll see an error there
[10:39:02] <georgekyz>	 isaranto: Nope, I do not understand... I can now load the quantized model and reproduce exactly the same results. I could not find what happened yesterday
[10:39:47] <aiko>	 elukey: thanksss!
[10:40:04] <aiko>	 indeeed it crossed the limits
[10:40:10] <aiko>	 "maximum cpu usage per Container is 8, but limit is 12, maximum cpu usage per Pod is 10, but limit is 15"
[10:44:37] <elukey>	 the limits are under deployment-chart's admin_ng, the values.yaml file for ml-serve should list all the namespaces and their config
[10:44:45] <elukey>	 if there is no config for a namespace, it gets the default
[10:44:52] <elukey>	 that is defined in admin_ng for all clusters
[10:44:59] <elukey>	 I need to go but ping me later if you have troubles
[10:45:29] <isaranto>	 I can check with Aiko Luca, thanks for the help!
[10:47:49] <isaranto>	 aiko: you can check kubectl get resourcequota to see the namespace quotas
[10:51:40] <isaranto>	 Apart from the resource quotas the limitranges are important. There is a specific limit per pod and per container. The default I see don't specify cpu so iiuc it just gets the default which is 8 
[10:51:47] <isaranto>	 ```
[10:51:47] <isaranto>	 maximum cpu usage per Container is 8, but limit is 12, maximum cpu usage per Pod is 10, but limit is 15
[10:51:47] <isaranto>	 ```
[10:52:55] <aiko>	 isaranto: right!
[10:52:57] <isaranto>	 So if we need to bump this up we'll need to add a limitrange entry under the revision-models key in the yaml, similar to what we have for article-descriptions here https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/values/ml-serve.yaml#225
[10:54:30] <isaranto>	 so there are 4 things to take into consideration :
[10:54:31] <isaranto>	 1. cpu per container  (limitrange) 
[10:54:31] <isaranto>	 2. cpu per pod (limitrange)
[10:54:31] <isaranto>	 3. total cpu of namespace (resourcequota)
[10:54:31] <isaranto>	 4. total cpu of node 
[10:54:50] <isaranto>	 and as it is reasonable 4 > 3 > 2 > 1
[10:55:21] <aiko>	 ty!! I checked the resourcequota, revision-models ns has 90 cpus and 100Gi mem. definitely enough. so we need to change limitranges for container and pod 
[10:55:41] <aiko>	 I'll file a patch for that!
[10:56:08] <isaranto>	 ok, I can review anytime. However we'll need Tobias to deploy it in the admin_ng ns
[10:57:38] <aiko>	 ack
[11:08:43] <klausman>	 Morning!
[11:09:00] <klausman>	 I can also assist with the limits (and deploying them)
[11:20:56] <aiko>	 the patch: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1117873
[11:21:10] <klausman>	 Looking
[11:22:49] <klausman>	 So the intent is to give the rev-models NS the same amount of resources aa article-desc?
[11:27:31] <klausman>	 aiko: I've added a c&p of what the rpduction diff would look like to the Gerrit review
[11:29:14] <isaranto>	 o/ Tobias
[11:33:45] <aiko>	 yeah I think 16/24 are reasonable numbers as we ask 12 for ref-quality
[11:34:00] <aiko>	 for now
[11:34:45] <klausman>	 roger
[11:35:09] <klausman>	 I can +2 for merge unless Ilias wants to add anything
[11:41:29] <klausman>	 Alright, proceeding
[11:46:04] <isaranto>	 aiko Isn't 44 Gi per pod too much?
[11:48:52] <aiko>	 maybe.. for mem I just follow what article-desc has 
[11:50:04] <isaranto>	 That one was specific because the model was big
[11:50:05] <aiko>	 ahh it's merged
[11:50:38] <aiko>	 I can file another patch to change that 
[11:50:40] <isaranto>	 No worries, we can check memory usage and put something more reasonable
[11:51:10] <aiko>	 ok!
[11:51:22] <klausman>	 ack, will push to the clusters in a moment
[11:54:05] <klausman>	 ok, staging done
[11:54:24] <klausman>	 aiko: did only staging bump into the limits so far, or prod as well?
[11:54:53] <klausman>	 (or put another way: should I also push to prod immediately?)
[12:08:29] <aiko>	 klausman: yep, I haven't deployed to prod, but there'd be the same issue there
[12:08:33] <isaranto>	 I think sth like 10Gi per pod would be enough, but we can wait to see the resource usage first
[12:09:15] <isaranto>	 the resourcequota in the ns is 100Gi so 44Gi per pod wouldn't work if we have 3 replicas
[12:09:40] <isaranto>	 it is not likely to happen, but it is good to have limitranges that make some sense
[12:16:19] <isaranto>	 thanks for taking care of that Aiko & Tobias
[12:22:43] <klausman>	 ack, pushed to all three clusters
[12:22:56] * klausman lunch (but I'm in shouting distance)
[12:58:27] * isaranto lunch!
[14:51:08] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Quantize aya-expanse-32B with GPTQ (GPTQModel) - https://phabricator.wikimedia.org/T384734#10528898 (10isarantopoulos) Building a wheel from source with the latest release (v.1.7.4) is successful. However building it from the tip of the main branch fa...
[14:52:50] <elukey>	 forgot to update you folks - ml-staging-codfw is running with the new knative, injecting security settings to (most of) the containers of pods being created
[14:53:10] <elukey>	 it shouldn't cause any trouble, in case please ping me or report the issue in https://phabricator.wikimedia.org/T369493
[14:53:30] <elukey>	 sadly the work is not yet completed, I need to find a way to inject the security context also on the istio containers
[14:53:50] <elukey>	 once done, we'll be ready to upgrade to the new k8s version anytime
[14:59:50] <isaranto>	 ack, thanks Luca!
[15:57:20] <chrisalbon>	 good morning all
[16:08:17] <kevinbazira>	 o/ whenever you get a minute please review: https://gerrit.wikimedia.org/r/1117840
[16:11:37] <isaranto>	 o/ Chris
[16:12:21] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] article-country: don't publish events when prediction results are empty [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1117840 (https://phabricator.wikimedia.org/T385771) (owner: 10Kevin Bazira)
[16:16:50] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] "thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1117840 (https://phabricator.wikimedia.org/T385771) (owner: 10Kevin Bazira)
[16:17:36] <wikibugs>	 (03Merged) 10jenkins-bot: article-country: don't publish events when prediction results are empty [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1117840 (https://phabricator.wikimedia.org/T385771) (owner: 10Kevin Bazira)
[17:07:27] <elukey>	 all right I think I found a way to fix the last  containers
[17:07:29] <elukey>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1117939
[17:07:51] <elukey>	 worked on ml-staging enforcing the PSS restricted profile
[17:10:02] <elukey>	 I am off till next week, and then there is the SRE summit, so I'll merge it sometimes next week :)
[17:10:05] <elukey>	 o/
[17:17:31] <klausman>	 \o
[17:35:33] <isaranto>	 \o
[17:35:55] * isaranto afk
[22:30:27] <wikibugs>	 06Machine-Learning-Team, 06Data-Engineering, 10Event-Platform: Create new mediawiki.page_links_change stream based on fragment/mediawiki/state/change/page - https://phabricator.wikimedia.org/T331399#10530563 (10Ladsgroup) Thanks! That can be quite useful. I might try it out in the hackathon