[06:06:50] <kart_>	 Updating recommendation-api in staging..
[06:16:45] <kevinbazira>	 ack!
[06:16:53] <kevinbazira>	 morning morning ...
[08:11:03] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work, 13Patch-For-Review: Create event stream for article-country model-server hosted on LiftWing - https://phabricator.wikimedia.org/T382295#10519909 (10dcausse) https://en.wikipedia.org/wiki/Eskimo_potato?action=cirrusDump  >>! In T382295#10516508, @kevinbazira...
[08:28:36] <isaranto>	 good morning!
[08:38:02] <georgekyz>	 good morning folks
[09:21:20] <elukey>	 hey folks! morning!
[09:21:30] <elukey>	 I have a very nice task to show you https://phabricator.wikimedia.org/T385531 :D
[09:22:07] <elukey>	 have we checked the size of the pytorch docker layers recently?
[10:07:03] <klausman>	 Morning!
[10:07:44] <klausman>	 I think Ilias recently worked on the pytorch images. He also mentioned that AMD is planning on providing much smaller images in the future (something like 20G -> 7G, but dont' quote me on that)
[10:09:12] <isaranto>	 o/ 
[10:09:57] <isaranto>	 looking..
[10:13:27] <elukey>	 ack :) klausman: when you have a moment can you check https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1116826 ?
[10:17:15] <isaranto>	 so I updated the amd-pytorch25 image and I didn't check the compressed image size 
[10:17:22] <klausman>	 elukey: looking
[10:17:29] <isaranto>	 sorry about that. looking now and will report in the task
[10:19:35] <elukey>	 isaranto: nah it is fine, it is probably a little big bigger
[10:20:07] <elukey>	 but I was able to publish it, so I suspect that when pytorch2.5 is uploaded if anything else tries to push at the same time we fill up the nginx reserved size
[10:20:32] <elukey>	 klausman: danke
[10:20:47] <elukey>	 deploying to staging, let's see if it works
[10:20:54] <klausman>	 :+1:
[10:46:05] <elukey>	 for some reason I don't see the new sec defaults popping up
[10:46:46] <isaranto>	 turns out the new docker image was 4.1GB compressed :( https://phabricator.wikimedia.org/T385531#10520305 
[10:47:18] <isaranto>	 I don't know if any layer exceeded the 4Gb limit though
[10:48:07] <klausman>	 elukey: as in helmfile diff?
[10:48:21] <elukey>	 nono in actual pod settings
[10:48:21] <klausman>	 Or kubectl show?
[10:48:29] <elukey>	 the deploy went fine
[10:48:37] <elukey>	 so there may be something that I am not doing right
[10:49:01] <elukey>	 I haven't updated the net-istio stuff, something that could be missing
[10:50:49] <elukey>	 isaranto: https://docker-registry.wikimedia.org/amd-pytorch25/tags/ was uploaded though
[10:50:58] <elukey>	 so possibly it is really really close
[10:51:32] <klausman>	 depending on how big the php-fpm image is, not much may be needed to in summary be over the shm limit
[10:51:48] <elukey>	 we could temporarily avoid to rebuild pytorch images
[10:52:19] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Quantize aya-expanse-32B with GPTQ (GPTQModel) - https://phabricator.wikimedia.org/T384734#10520352 (10gkyziridis) The issue was that the [[ https://github.com/ModelCloud/GPTQModel | GPTQModel repo ]] was and it could not work with Rocm driver in our...
[10:53:04] <isaranto>	 elukey: I dont see the latest tag on https://docker-registry.wikimedia.org/amd-pytorch25/tags/ so I don't think it has been updated
[10:53:10] <isaranto>	 *uploaded
[10:56:07] <elukey>	 isaranto: ahhh 6.2!
[10:56:28] <elukey>	 yeah then let's revert for the moment
[10:57:03] <isaranto>	 okk
[10:57:07] * isaranto sigh
[10:57:33] <isaranto>	 I saw a new promising thing from AMD https://phabricator.wikimedia.org/T385173
[10:58:08] <isaranto>	 although based on ubuntu the new image seems to be 7.6GB. BUT the image was removed from dockerhub a couple of days later
[10:58:32] <isaranto>	 I did pull it so I have it locally to investigate it a bit
[10:58:47] <elukey>	 nice!
[10:59:10] <isaranto>	 one of these days I hope
[11:18:55] <elukey>	 isaranto: https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1117158
[11:20:06] <elukey>	 docker-registry.discovery.wmnet/amd-pytorch25                                            2.5.1rocm6.2-1-20250202        bd89642271bc   23 hours ago    18.6GB
[11:20:31] <elukey>	 uff rebase issue
[11:22:08] <elukey>	 https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1117158
[11:22:14] <elukey>	 basically two changelog reverts
[11:23:01] <elukey>	 isaranto: --^ do you mind to double check just in case?
[11:23:32] <isaranto>	 in a meeting currently will check in 40'!
[11:30:52] <elukey>	 np already merged, I am trying to unblock the build images now
[11:30:55] <elukey>	 should be good
[12:15:16] * klausman lunch
[12:38:49] <isaranto>	 ack
[12:38:58] <isaranto>	 I see the new tag 2.5.1rocm6.1-1-20250126
[12:40:00] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Quantize aya-expanse-32B with GPTQ (GPTQModel) - https://phabricator.wikimedia.org/T384734#10520781 (10isarantopoulos) p:05Triage→03Medium a:03gkyziridis
[12:40:22] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Quantize aya-expanse-32B with GPTQ (GPTQModel) - https://phabricator.wikimedia.org/T384734#10520807 (10isarantopoulos) a:05gkyziridis→03isarantopoulos
[13:14:43] <georgekyz>	 I am trying to run simulation with different batch size and I am receiving `torch.OutOfMemoryError`https://phabricator.wikimedia.org/P73142#293343 does anybody encountered the same issue ?
[13:27:13] <isaranto>	 you'd get this error when there is no more VRAM left on the gpu. try running `sudo nvtop` while you are running this to monitor the GPU resources
[13:27:31] <georgekyz>	 alright thnx 
[13:28:06] <isaranto>	 to see if anyone is using it as well
[13:33:31] <isaranto>	 there seems to be a lot of memory reserved but unallocated for torch. It is odd since the only thing I see changing is the batch size from 4->6
[13:33:49] <isaranto>	 a the bits also change to 2 but that means that it would be even smaller in memory
[13:34:36] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Create SLO dashboard for article-country model - https://phabricator.wikimedia.org/T384935#10520984 (10isarantopoulos)
[13:38:12] <georgekyz>	 isaranto: exactly...
[13:38:41] <georgekyz>	 there is probably something strange because now it takes much time to load the dataset
[13:57:43] <isaranto>	 since we see high losses during the process I would try to see if quantization with 8 bits (everything else as is) would result in lower errors
[13:58:02] <isaranto>	 and then also with a bigger part of the dataset
[14:28:48] <georgekyz>	 isaranto: I tried initially using a bigger part of the dataset but I received errors. I also tried to used bigger group_sizes but same. I will try again 
[14:46:31] <wikibugs>	 06Machine-Learning-Team: Adding uv as a package manager on Lift Wing/blubber - https://phabricator.wikimedia.org/T384584#10521242 (10gkyziridis) Hey @dduvall thank you for checking this out.  I cannot use the syntax:  `# syntax = docker-registry.wikimedia.org/repos/releng/blubber/buildkit:v1.0.1` in my blubber.y...
[15:23:36] <elukey>	 isaranto, klausman o/ - I filed https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1117207
[15:24:07] <elukey>	 really sorry for my previous attempt, I am not really sure how I didn't spot the patch command horror that I added
[15:24:10] <klausman>	 ack, will review once meeting is over
[15:24:16] <elukey>	 anyway, I also added other patches :D
[15:24:24] <elukey>	 IN THEORY this should be enough
[15:24:36] <elukey>	 following the maze of knative commits is hard
[15:49:24] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: [onboarding] Update revertrisk to kserve 0.14.1 - https://phabricator.wikimedia.org/T383119#10521521 (10isarantopoulos) p:05Triage→03Low
[16:01:22] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Create SLO dashboard for article-country model - https://phabricator.wikimedia.org/T384935#10521611 (10kevinbazira) Here are the article-country load test results:   | 1 | Type | Name                               | Request Count | Failure Count | Median Response Time | Av...
[17:25:12] <isaranto>	 going afk folks, have a nice evening/rest of day o/
[17:39:51] <wikibugs>	 06Machine-Learning-Team: Adding uv as a package manager on Lift Wing/blubber - https://phabricator.wikimedia.org/T384584#10522050 (10dduvall) >>! In T384584#10521242, @gkyziridis wrote: > Hey @dduvall thank you for checking this out.  > I cannot use the syntax:  `# syntax = docker-registry.wikimedia.org/repos/re...
[18:07:12] <chrisalbon>	 night ilias!
[18:18:45] <elukey>	 hey folks, ml-staging is currently having issues (revscoring-editquality-damaging) since the knative testing is not went as expected, something is still off
[18:18:50] <elukey>	 will restart tomorrow!
[18:19:31] <elukey>	 no sorry pods are up now, buuut knative is not really working as I expected sigh
[18:19:34] <elukey>	 anywayyy o/