[08:16:04] hm.. another point and a possibly annoying one is that while testing I see some variations depending on how you format the context. mainly wondering if we could do a quick round of test with the evaluation set before comitting to a particular shape [10:43:47] lunch [14:01:38] o/ [14:33:38] \o [14:34:12] yea we can evaluate the context shape [14:41:28] .o/ [14:47:25] hmm, i guess i'll try a few llama-server variants today. cdanis mentioned in the ticket about BLIS or BLAS. Looking at it, it looks like we are using the built-in AVX512 kernels, no BLIS or BLAS (linear algebra libraries). might improve things [14:50:26] ebernhardson: might, might not. literally all I know on the topic is from https://justine.lol/matmul/ and that post is 2 years old [14:50:46] maybe also try out `llamafile` and see how it performs on CPU, it should be easy to do [14:51:32] (but i would kind of expect that after 2 years that kind of improvement made it upstream to llama.cpp anyway, and, i don't have any intuition for interpreting the 2yo hard benchmark numbers there) [14:53:57] i don't know much either, we just ran the ghcr.io/ggml-org/llama.cpp:server docker container. They have gpu variants, but nothing fancy for cpu. Looks easy enough to compile up a custom container though [14:54:02] shout out to the HP pro series! great for homelabs [14:54:33] o/ [14:54:56] i suppose i can toy with numactl too, was intending to try pinning it to a socket but hadn't got around to it [14:58:40] just saw some users reporting that openblas does not scale well with multiple threads (https://github.com/ggml-org/llama.cpp/issues/5534) [15:01:27] ouch, that does look bad if true [15:01:34] for blis I have no clue... I thought it was mainly useful for amd but I don't know much about these tbh [15:01:56] iiuc blis and blas are both implementing the same API, one is just a newer implementation from someone else [15:02:19] ah no seems like amd have their own fork of blis? (https://github.com/flame/blis/blob/master/docs/FAQ.md#what-is-the-difference-between-blis-and-the-amd-fork-of-blis-found-in-aocl) [16:08:25] hm... was perhaps a bit ambitious to extract 600 query pairs with 7 variations of reranking settings include some with k=30 without proper parellism and using all relforge nodes to parallelize llama... [16:08:37] :) [16:09:28] I guess I'll limit to k <= 10 otherwise it won't finish today :) [16:22:07] at the cost of a count() I think I'll add an argument to set the expected number of rows per task instead of asking a plain number of partitions [16:22:25] often makes sense, especially if the rows come from disk [16:22:36] count() on cached dataframe should have very minimal impact I guess? [16:23:06] well, the first count() will take time as it resolves it, but the thing you do next with it should in theory be cached and fast [16:23:16] sure [16:24:16] just merged https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1238306 so `opensearch-semantic-search-test` should have access to liftwing now. Will add a patch for `opensearch-semantic-search` ns shortly if that doesn't already exist [16:24:33] inflatador: re setings in opensearch, I read that setting additionConfig might propagate settings via env vars not writing them to opensearch.yaml https://forum.opensearch.org/t/cannot-set-additionalconfig-using-opensearch-k8s-operator/20231 [16:24:48] annoyingly the base docker container doesn't have llama-bench, i guess i gotta build a non-BLAS version to compare. With the BLAS version different options result in between 250 and 400 tok/sec. no clue if good or not :P [16:24:56] thanks for the merge! [16:25:25] is it on your own hardware or relforge? [16:25:41] dcausse yeah, that's true and very confusing ;(. But I think we should be able to see the config via curl regardless [16:25:52] sure [16:25:59] it's on relforge1010 [16:26:03] ok [16:26:55] Re: config application, I can look again, but last time I checked it wasn't actually being set. It was that exact settings from the forum thread ;p [16:26:57] best results so far are settings threads to 12 (instead of 24), or using numactl pinning and using 10 threads. both ~same [16:28:30] more performant on fewer threads this might be similar to what I'm seeing in yarn? [16:29:47] yea it could be, 12 threads probably means it's using all the "real" cores on a single socket and not using the hyper-threads [16:30:33] decided to be a little more methodical, now running a bash script loop over a variety of options [16:31:09] note I'm running queries on relforge, might possibly add noise to your test :/ [16:31:28] hmm, i have top open and only llama-bench is taking cpu on 1010 [16:32:21] I guess most of the time spent on my end is now reranking (which is pinned to relforge1008) [16:32:39] I'll stop the extraction notheless I think I have enough "variations" [16:33:46] perhaps most curious is we can use a lot more cpu, probably (not 100% sure) more power, and get much worse results. So far the worst bench gets 21 t/s (threads=24, batch=32) vs 442 t/s (threads=12, batch=512) [16:33:55] although batch might not be worth testing, our batch will probably always be 10 [16:47:33] hmm, actually batch might be the number of tokens per evaluation, rather than a simple batch. poking through the 1008 container logs i see batches of 1k or so [16:48:08] the logs also suggest it's doing 3 evaluations, 4 docs + 4 docs + 2 docs, will see if setting it to 5+5 makes a difference [17:05:02] yes... the batch settings are quite hard to understand... it seems like a combination of ubatch and batch [17:05:44] for some reasons I had to increase ubatch to 8196 to avoid failures... [17:05:58] but did not touch batch, which is probably bad? [17:07:00] hmm, interesting. I've noticed in my tests so far batch sizes do best at 512 (testing 32, 128, 512, 2048). 2048 is perhaps 40% faster [17:07:06] err, 512 is 40% faster [17:08:03] do the tests include long contexts? [17:08:03] and using excessive threads is beyond terrible (lowest so far is 24 threads pinned to one cpu (so all threads+hyperthreads). Gets 34 tok/s at batch=512 [17:08:44] i'm not 100% sure, my understanding so far is it has a bench for input tokens, and a bench for output tokens. Since we are all-input i've been doing just the input token bench. So 2048 should be 2048 input tokens iiuc [17:09:00] output tokens are also way-way-way slower [17:10:03] is it using the model you want or an artificial network? [17:10:13] it's using the qwen3-reranker model. [17:10:34] in one test, 408 token/s on 512 input tokens, then 64 tok/s on 128 output tokens [17:10:48] but if i understand the reranker, it's a single output token [17:11:06] yes the reranking task produces yes or no [17:11:36] but does the llama bench have a rerank mode? [17:11:49] i don't think so, it just separates the input and output token benches [17:13:12] so it's probably not using the rerank prompt [17:13:31] and thus not sure what it's doing [17:15:21] i think it's just using random tokens, but since it's all about matrix multiplication i suspect it doesn't matter what the tokens are [17:15:49] can you set ubatch to 8k and feed tokens up to 8k? [17:16:53] yea, i'll adjust the script to drop the lower ones (32/128/512) and test some longer contexts [17:17:22] looks like i can also skip most of the numa options, pinning to a single socket seems best (vs no pinning, or interleaving both sockets) [17:30:32] lunch, back in ~1h [17:31:38] ebernhardson: out of curiosity are you trying: https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/BLIS.md or something else ? [17:32:18] dcausse: currently openblas (custom compiled container). Will also try the upstream :full image (has llama-bench, :server doesn't) [17:32:32] ok [17:56:35] huh, turns out the cpu version is way faster than openblas. 12 threads / 512 input tokens comes in at 581 t/s [17:56:47] :/ [17:58:06] slows dramatically as the batch size scales though, 256 batch gives 615, 2048 does 432, 8096 does 323 [18:13:12] meh... BroadcastExchangeExec: Could not execute broadcast in 300 secs, why is spark willing to broadcast?? [18:14:47] :S can't say i've seen that one before [18:17:51] ahh poking around, it suggests the reason BLAS isn't faster is because we are using Q8, but blas only does fp16 or fp32, so it's doing an extra step to de-quantize and then using more expensive math [18:22:06] :/ [18:23:42] used Q8 because it seemed to be the best for cpus... can try to build a fp16 gguf model [18:24:23] probably fine to keep as is, since the default build already has q8 speialized routines [18:24:28] *specialized [18:50:12] on the upside, it looks like the sockets are fully separate. pinning a bench to each socket gets a total throughput > 1k tok/sec [18:58:45] nice [19:13:56] OpenSearch plugin question: is `analysis-icu-client` the same as `analysis-icu`? It looks that way but just making sure [19:14:07] ref https://github.com/opensearch-project/OpenSearch/tree/main/plugins/analysis-icu vs https://mvnrepository.com/artifact/org.opensearch.plugin/analysis-icu-client/3.3.2 [19:19:51] inflatador: hmm, looking but i doubt it [19:21:51] it's possible to install `analysis-icu` from CLI, but where that comes from is opaque and I'm not sure how to get checksums if we do it that way [19:22:26] maybe there's a verbose flag I can try [19:23:16] yea it's always opaque :S i'm not finding a definitive answer yet [19:24:20] inflatador: hmm, well checking the install on relforge1010 the analysis-icu plugin has jars for analysis-icu-client, icu4j, and lucene-analysis-icu. So maybe? [19:36:11] Ah OK. Looks like I can also get the plugin from https://artifacts.opensearch.org/releases/plugins/analysis-icu/3.3.2/analysis-icu-3.3.2.zip (thanks ChatGPT) [19:40:48] And it has checksums too, yay! https://artifacts.opensearch.org/releases/plugins/analysis-icu/3.3.2/analysis-icu-3.3.2.zip.sha512 [19:41:21] ebernhardson: ah, sorry for sending you on a wild goose chase [19:41:52] cdanis: no worries, i did find by benchmarking that we can improve speed with --numa numactl, so it more intelligently uses both sockets [19:42:05] nice, that makes sense [19:42:22] about 550 on one socket, 800 on both sockets, or 1100 with 2 processesses one pinned to each socket [19:42:47] is that input tok/s ? [19:43:02] yea, the way re-ranking works there is only 1 output token, so only inputs matter [19:43:05] right [19:43:07] cool :) [19:43:33] still slower than we might want, but better :) [20:18:04] playing with the demo connected, looks like main difference is we now stay consistently under 3s total, but still 2.5-3s range typically [20:20:07] dcausse: btw i moved then llama-server into the ~opensearch/docker-compose.yml, same config on all instances [20:46:53] thanks! [20:57:47] * dcausse realizes he should have customized this file instead of hacking the image... [20:57:58] no worries :) [20:58:40] hmm, i just realized when i made them all the same it lost the opensearch-dashboards on relforge1008, but i only used that at the very beginning [20:59:00] well, not sure how to pass the wmf certs there, tho it needs a script to ruin... [20:59:27] me too I used it to list models and such but stopped opening it [21:00:47] I think most the of the customization were to escape the proxy... but seems like we no longer the proxy there? [21:02:10] If you need help with the compose hackery LMK, you can probably mount the certs from the hosts paths like `/etc/ssl/certs` to whatever the Red Hat version is and/or use `ar` to unarchive the deb pkg and shove the certs into the RHEL path and run their version of `update-ca-certs`. I have prior art in my homelab somewhere ;) [21:03:18] I added /etc/pki/ca-trust/source/anchors/Wikimedia_Internal_Root_CA.crt and ran update-ca-trust [21:03:40] Was that enough to get it to work? [21:04:14] yes I needed to point java to /etc/pki/ca-trust/extracted/java/cacerts with -Djavax.net.ssl.trustStore=/etc/pki/ca-trust/extracted/java/cacerts [21:04:50] ideally I'd like to mount something so that I don't need to run a command and can put all this in the docker-compose [21:04:57] dcausse: what base image are you using? [21:05:03] centos :( [21:06:01] ahh okay 😔 [21:06:05] ` "/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem:/etc/ssl/certs/ca-certificates.crt"` is what I have in my homelab, but that's mounting from RH to Debian [21:06:21] if you flip the order it will (hopefully) work [21:06:38] it's the default opensearch image, but we realiazed it's not super well packaged, java is not installed via standard packages :/ [21:06:49] java might complicate things though. let me check if I've done that with a java app [21:08:13] well I suppose debian must have java truststore as well I could possibly mount the host one there too, I don't think we store anything private in there except cert roots [21:09:07] yeah, you have the correct path for Red Hat java truststore...you can probably mount the host version there too, nothing private there like you said [21:10:08] ack, I'll cleanup these a bit tomorrow, having to hack the image is too painful [21:11:04] I can help with a Debian-based image too if you prefer. I'm almost done with the icu-plugin part [21:11:13] just in case it'd be helpful to have a stable place to get .crt files from, there's the `wmf-certificates` deb on apt.wikimedia.org ... definitely doesn't help directly, but shrug [21:13:24] hmm, i didn't change the proxy settings, and it looks like opensearch is starting with some proxy options. Probably those can go into the env: section of docker-compose [21:15:48] the docker-compose file has some OPENSEARCH_JAVA_OPTS with the proxy set, I had to set things like -Dhttps.nonProxyHosts=localhost|127.0.0.1|*.wmnet|*.wikimedia.org & -Dhttp.nonProxyHosts=localhost|127.0.0.1|*.wmnet|*.wikimedia.org to escape it [21:16:20] ahh! ok yea those probably need to be added [21:16:25] also, stop working, it's late :P [21:16:31] :)