[06:06:50] Updating recommendation-api in staging.. [06:16:45] ack! [06:16:53] morning morning ... [08:11:03] 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work, 13Patch-For-Review: Create event stream for article-country model-server hosted on LiftWing - https://phabricator.wikimedia.org/T382295#10519909 (10dcausse) https://en.wikipedia.org/wiki/Eskimo_potato?action=cirrusDump >>! In T382295#10516508, @kevinbazira... [08:28:36] good morning! [08:38:02] good morning folks [09:21:20] hey folks! morning! [09:21:30] I have a very nice task to show you https://phabricator.wikimedia.org/T385531 :D [09:22:07] have we checked the size of the pytorch docker layers recently? [10:07:03] Morning! [10:07:44] I think Ilias recently worked on the pytorch images. He also mentioned that AMD is planning on providing much smaller images in the future (something like 20G -> 7G, but dont' quote me on that) [10:09:12] o/ [10:09:57] looking.. [10:13:27] ack :) klausman: when you have a moment can you check https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1116826 ? [10:17:15] so I updated the amd-pytorch25 image and I didn't check the compressed image size [10:17:22] elukey: looking [10:17:29] sorry about that. looking now and will report in the task [10:19:35] isaranto: nah it is fine, it is probably a little big bigger [10:20:07] but I was able to publish it, so I suspect that when pytorch2.5 is uploaded if anything else tries to push at the same time we fill up the nginx reserved size [10:20:32] klausman: danke [10:20:47] deploying to staging, let's see if it works [10:20:54] :+1: [10:46:05] for some reason I don't see the new sec defaults popping up [10:46:46] turns out the new docker image was 4.1GB compressed :( https://phabricator.wikimedia.org/T385531#10520305 [10:47:18] I don't know if any layer exceeded the 4Gb limit though [10:48:07] elukey: as in helmfile diff? [10:48:21] nono in actual pod settings [10:48:21] Or kubectl show? [10:48:29] the deploy went fine [10:48:37] so there may be something that I am not doing right [10:49:01] I haven't updated the net-istio stuff, something that could be missing [10:50:49] isaranto: https://docker-registry.wikimedia.org/amd-pytorch25/tags/ was uploaded though [10:50:58] so possibly it is really really close [10:51:32] depending on how big the php-fpm image is, not much may be needed to in summary be over the shm limit [10:51:48] we could temporarily avoid to rebuild pytorch images [10:52:19] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Quantize aya-expanse-32B with GPTQ (GPTQModel) - https://phabricator.wikimedia.org/T384734#10520352 (10gkyziridis) The issue was that the [[ https://github.com/ModelCloud/GPTQModel | GPTQModel repo ]] was and it could not work with Rocm driver in our... [10:53:04] elukey: I dont see the latest tag on https://docker-registry.wikimedia.org/amd-pytorch25/tags/ so I don't think it has been updated [10:53:10] *uploaded [10:56:07] isaranto: ahhh 6.2! [10:56:28] yeah then let's revert for the moment [10:57:03] okk [10:57:07] * isaranto sigh [10:57:33] I saw a new promising thing from AMD https://phabricator.wikimedia.org/T385173 [10:58:08] although based on ubuntu the new image seems to be 7.6GB. BUT the image was removed from dockerhub a couple of days later [10:58:32] I did pull it so I have it locally to investigate it a bit [10:58:47] nice! [10:59:10] one of these days I hope [11:18:55] isaranto: https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1117158 [11:20:06] docker-registry.discovery.wmnet/amd-pytorch25 2.5.1rocm6.2-1-20250202 bd89642271bc 23 hours ago 18.6GB [11:20:31] uff rebase issue [11:22:08] https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1117158 [11:22:14] basically two changelog reverts [11:23:01] isaranto: --^ do you mind to double check just in case? [11:23:32] in a meeting currently will check in 40'! [11:30:52] np already merged, I am trying to unblock the build images now [11:30:55] should be good [12:15:16] * klausman lunch [12:38:49] ack [12:38:58] I see the new tag 2.5.1rocm6.1-1-20250126 [12:40:00] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Quantize aya-expanse-32B with GPTQ (GPTQModel) - https://phabricator.wikimedia.org/T384734#10520781 (10isarantopoulos) p:05Triage→03Medium a:03gkyziridis [12:40:22] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Quantize aya-expanse-32B with GPTQ (GPTQModel) - https://phabricator.wikimedia.org/T384734#10520807 (10isarantopoulos) a:05gkyziridis→03isarantopoulos [13:14:43] I am trying to run simulation with different batch size and I am receiving `torch.OutOfMemoryError`https://phabricator.wikimedia.org/P73142#293343 does anybody encountered the same issue ? [13:27:13] you'd get this error when there is no more VRAM left on the gpu. try running `sudo nvtop` while you are running this to monitor the GPU resources [13:27:31] alright thnx [13:28:06] to see if anyone is using it as well [13:33:31] there seems to be a lot of memory reserved but unallocated for torch. It is odd since the only thing I see changing is the batch size from 4->6 [13:33:49] a the bits also change to 2 but that means that it would be even smaller in memory [13:34:36] 10Lift-Wing, 06Machine-Learning-Team: Create SLO dashboard for article-country model - https://phabricator.wikimedia.org/T384935#10520984 (10isarantopoulos) [13:38:12] isaranto: exactly... [13:38:41] there is probably something strange because now it takes much time to load the dataset [13:57:43] since we see high losses during the process I would try to see if quantization with 8 bits (everything else as is) would result in lower errors [13:58:02] and then also with a bigger part of the dataset [14:28:48] isaranto: I tried initially using a bigger part of the dataset but I received errors. I also tried to used bigger group_sizes but same. I will try again [14:46:31] 06Machine-Learning-Team: Adding uv as a package manager on Lift Wing/blubber - https://phabricator.wikimedia.org/T384584#10521242 (10gkyziridis) Hey @dduvall thank you for checking this out. I cannot use the syntax: `# syntax = docker-registry.wikimedia.org/repos/releng/blubber/buildkit:v1.0.1` in my blubber.y... [15:23:36] isaranto, klausman o/ - I filed https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1117207 [15:24:07] really sorry for my previous attempt, I am not really sure how I didn't spot the patch command horror that I added [15:24:10] ack, will review once meeting is over [15:24:16] anyway, I also added other patches :D [15:24:24] IN THEORY this should be enough [15:24:36] following the maze of knative commits is hard [15:49:24] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: [onboarding] Update revertrisk to kserve 0.14.1 - https://phabricator.wikimedia.org/T383119#10521521 (10isarantopoulos) p:05Triage→03Low [16:01:22] 10Lift-Wing, 06Machine-Learning-Team: Create SLO dashboard for article-country model - https://phabricator.wikimedia.org/T384935#10521611 (10kevinbazira) Here are the article-country load test results: | 1 | Type | Name | Request Count | Failure Count | Median Response Time | Av... [17:25:12] going afk folks, have a nice evening/rest of day o/ [17:39:51] 06Machine-Learning-Team: Adding uv as a package manager on Lift Wing/blubber - https://phabricator.wikimedia.org/T384584#10522050 (10dduvall) >>! In T384584#10521242, @gkyziridis wrote: > Hey @dduvall thank you for checking this out. > I cannot use the syntax: `# syntax = docker-registry.wikimedia.org/repos/re... [18:07:12] night ilias! [18:18:45] hey folks, ml-staging is currently having issues (revscoring-editquality-damaging) since the knative testing is not went as expected, something is still off [18:18:50] will restart tomorrow! [18:19:31] no sorry pods are up now, buuut knative is not really working as I expected sigh [18:19:34] anywayyy o/