[07:20:12] (03PS1) 10Kevin Bazira: article-country: use weighted tags stream instead of SQLite db [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1119370 (https://phabricator.wikimedia.org/T385970) [07:28:17] (03CR) 10Kevin Bazira: "For more context, this patch is an equivalent of the `title_to_links()` from the Research team's article-country prototype:" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1119370 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [07:51:50] Good morning folks o/ [08:12:10] Goedemorgen! [08:47:03] morning :) [08:56:22] \o [10:10:55] klausman: I have trouble connecting to docker hub form ml-lab [10:12:43] it seems that there is a connection issue, you can replicate it by running `docker run hello-world`. [10:12:43] I think I'd try 2 things: restarting the docker service and setting the proxy details in /etc/docker/docker.json [10:13:53] I tried to create a config file to do this but I don't have permissions either to create a file under /etc/docker/docker.json or restart the docker service [10:13:58] this is what I want to set https://docs.docker.com/engine/cli/proxy/#configure-the-docker-client [10:30:26] 06Machine-Learning-Team: [LLM] ML-lab benchmarking - https://phabricator.wikimedia.org/T382343#10548614 (10gkyziridis) Unfortunately [[ https://github.com/huggingface/optimum-benchmark/tree/e0b65f8587af7a3174a7414ab6972a55c4d97a28 | optimum-benchmark ]] does not support `GPTQModel` yet. Even if I installed the `... [10:38:35] 06Machine-Learning-Team: [LLM] ML-lab benchmarking - https://phabricator.wikimedia.org/T382343#10548638 (10isarantopoulos) Thanks for pointing that out. I opened an[[ https://github.com/huggingface/optimum-benchmark/issues/315 | issue about it on GH ]] [10:48:22] isaranto: looking [10:48:31] Danke! [10:51:43] isaranto: updated the config and restarted dockerd, can you test? [10:55:52] I'm stilll getting the same issue. I tried login in and out of the machine but the same thing happens [10:56:57] mh, I fixed some extra `;`, it might work now [10:57:43] 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work, 13Patch-For-Review: Update the article-country isvc to use Wikilinks for predictions - https://phabricator.wikimedia.org/T385970#10548654 (10Gehel) It looks like you are planning on using cirrusdoc directly, which isn't supposed to be a stable interface. I'... [10:58:24] thank you for the help! still get the same issue. Does it work for you? [10:58:47] do you have an example cmdline? [11:01:17] ah, the helloworld one I can try. [11:01:27] yeah, confirmed not working. let me do some digging [11:08:06] It looks like this version of docker does not support proxy config :( [11:11:38] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Create dummy peacock detection model server - https://phabricator.wikimedia.org/T386100#10548738 (10isarantopoulos) I'd like to start the conversation and propose input/output schemas for the service. We'd like to define a schema that would also be future pr... [11:12:58] isaranto: fixed it! Our version of dockerd doesn't allow for proxy conf in /etc, only via env vars [11:12:59] aha, interesting [11:13:15] I just tried hellow world and it worked [11:13:39] same here! [11:13:46] Deleted the image so you can verify pulling works for you, too [11:14:31] yes works Tobias, now I am downloading the image I need . Thanks a lot <3 [11:14:35] how did u fix it? [11:16:51] You can set the env for dockerd in /etc/defaut/docker [11:18:25] ack, thanks again! [11:20:08] np :) [11:23:02] well it seems like I opened a can of worms for ya [11:23:22] now it seems that docker doesnt have any disk space left [11:23:26] `failed to register layer: ApplyLayer exit status 1 stdout: stderr: write /opt/rocm-6.3.1/lib/hipblaslt/library/TensileLibrary_I8I8_I8I8_A_SAV_UA_Type_I8I8_HPA_Contraction_l_Ailk_Bljk_Cijk_Dijk_CU80_gfx942.co: no space left on device` [11:24:32] ah, that's an easy fix [11:26:04] try again. [11:28:11] I moved the docker image dir to Ceph, that should be plenty of space ;) [11:28:31] oh great! I'm running it again [11:33:42] oups got the same issue [11:34:15] I don't knwo if I have to logout/login again to get the updated changes. I did this inside a tmux session [11:34:28] going afk for lunch and an errand will try again later [11:35:46] mmmh, I wonder how it can run out of diskspace. ll have lunch also, we can tackle things afterwards [12:04:49] o/ I've put together a DFD to visualize the data flow in the article-country model-server: https://phabricator.wikimedia.org/F58394877 [12:04:49] the goal is to make it easier for everyone, especially those who haven't worked on a particular model-server, to get a quick grasp of how it works. [12:04:49] let me know if you find this kind of diagram useful! [12:31:38] awesome Kevin! will review later. We should add that to the task for now and then to documentation later [13:04:33] I tried download the rocm/vllm image and got the same image during writing the same file [13:04:33] ``` [13:04:33] failed to register layer: ApplyLayer exit status 1 stdout: stderr: write /opt/rocm-6.3.1/lib/hipblaslt/library/TensileLibrary_I8I8_I8I8_A_SAV_UA_Type_I8I8_HPA_Contraction_l_Ailk_Bljk_Cijk_Dijk_CU80_gfx942.co: no space left on device [13:04:33] ``` [13:21:13] This is weird. it seems like the container itself is running out of disk space? [13:22:36] Which it really shouldn't, given the amnt of disk space docker has for that purpose. [13:23:50] isaranto: can you pastebin the command(s) you' [13:23:55] re running? [13:26:29] yes , just a sec [13:26:55] I have a suspicion that something is built in your homedir, which is of course much smaller [13:29:33] here it is https://phabricator.wikimedia.org/P73461 [13:40:00] Ok, I arranged things so that /home is now on Ceph and I am trying your commands one at a time. Nothing bad happening so far. I do recommend logging out and back in so you're in the right place working-directory-wise [13:45:03] Yeah, there is somethign weird going on. the docker build command failed exactly as it did with you, but none of the local filesystems is anywhere near full [13:51:32] 06Machine-Learning-Team: [LLM] ML-lab benchmarking - https://phabricator.wikimedia.org/T382343#10549151 (10gkyziridis) >>! In T382343#10419920, @achou wrote: > **llmperf** > > Below are the results from llmperf benchmarking for aya-expanse-8b and 32b original and quantized models (AWQ, Bitsandbytes, and GPTQ):... [13:58:44] isaranto: fixed it! So when pulling and unpacking an image, Docker uses a sandbox filesystem. By default it's capped at 20G. I have increased the limit to 100G. the docker pull now works, I suspect the build command will as well [14:00:13] Folks I am trying to reproduce the results in https://phabricator.wikimedia.org/T382343 using llmperf but I cannot. Does anybody used it recently [14:00:16] ? [14:11:10] klausman: thnx and interesting find, wasn't aware [14:11:30] Neither was I until I started digging into the dockerd code :) [14:12:21] The functionality is sorta documented, but going from "Docker pull fails with ENOSPC" to "I need this cmdline flag" is not easy since there's no keywords to tie them together [14:19:14] georgekyz: o/ I used conda at that time. let me find if I have some notes for that [14:19:34] aiko: thnx [14:23:11] aiko: I wanted to mention this about peacock detection in our meeting earlier and I forgot https://phabricator.wikimedia.org/T386100#10548737. It is not urgent ofc just wanted to bring it up [15:09:01] 06Machine-Learning-Team: [LLM] ML-lab benchmarking - https://phabricator.wikimedia.org/T382343#10549516 (10achou) > - What virtual environment manager you used to built the env (I see `uv.lock`) > - Which torch version you used? I created a conda env with python 3.11, and used `export PYTHONPATH="$(conda info... [15:59:15] isaranto: ack! [16:01:04] 06Machine-Learning-Team: [LLM] ML-lab benchmarking - https://phabricator.wikimedia.org/T382343#10549946 (10gkyziridis) Thank you for your help @AikoChou. I thinks there is an issue with `typing_extensions.cpython-311.pyc`. The `llmperf` tries to install `typing-extensions>=4.12.2` but we are using `typing_exten... [16:11:07] 06Machine-Learning-Team: [LLM] ML-lab benchmarking - https://phabricator.wikimedia.org/T382343#10550000 (10achou) > The `llmperf` tries to install `typing-extensions>=4.12.2` but we are using `typing_extensions-4.9.0` in the `PYTHONPATH` for rocm. Ah right, I ran into the same issue before. I solved it by using... [16:38:26] klausman: I have some issues now in the docker build which again seem to be connectivity issues. for the time being though I am working/exploring the upstream provided docker image, so it isn't urgent we can check it tomorrow if you have some time. I put the info in the same paste https://phabricator.wikimedia.org/P73461#294513 [16:39:10] 06Machine-Learning-Team: [LLM] ML-lab benchmarking - https://phabricator.wikimedia.org/T382343#10550178 (10MunizaA) Hi @gkyziridis, apologies if the documentation doesn't make it clear but the path that you're passing to `--config-dir` needs to exist and contain experiment config files. If you've already cloned... [16:39:23] I suspect the proxy env vars need to be set inside the container as well [16:42:23] ah yes I'll try that , dumb to not think about it [16:42:57] nah, it's one of those nonobvious things that happesn when you have layers upon layers [16:53:26] * isaranto afk!