[09:28:39] no permissions for [cluster:admin/opensearch/ml/predict] and User [name=opensearch, ...] [10:27:42] seems like perf degrades very quickly when using liftwing vs using pure knn on a random vector [10:27:58] not clear if it's a connector limitation or a liftwing issue [11:31:49] lunch [12:15:38] pre-fetching vectors from liftwing shows that liftwing is not to blame... [13:31:45] can do ~34qps on relforge and a concurrency of 10, dse-k8s can barely sustain 1.3qps [13:37:52] on relforge the limiting factor is liftwing, k=3 has p50 of ~40ms vs liftwing at 240ms [13:48:09] dcausse: thanks for running those tests. Do we have any metrics from the OpenSearch-test cluster that could reveal the limiting factor? [13:49:27] pfischer: there are a couple dashboards https://grafana-rw.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&from=now-1h&to=now&timezone=utc&var-datasource=P0AF0B00C3C579A2D&var-namespace=opensearch-semantic-search&var-pod=$__all&var-container=$__all [13:49:41] and https://grafana-rw.wikimedia.org/d/c0a89788-c6fe-4d06-aeb2-70b63049599e/opensearch-on-k8s?orgId=1&from=now-1h&to=now&timezone=utc&var-datasource=P0AF0B00C3C579A2D&var-interval=1m&var-cluster=opensearch-semantic-search&var-node=opensearch-semantic-search-masters-3&var-shard_type=$__all&var-pool_name=$__all [13:49:54] the 100% cpu usage recently is me optimizing the index [13:51:43] hesitating between cpu & IO, we give only 1cpu which possibly making the thread pools not very useful [13:52:18] will perhaps quickly try with 4 cpus/node to see if this helps [13:52:26] if not it's likely IO? [13:56:26] \o [13:56:40] o/ [13:56:44] mucking with thread pools seems plausible, 1 cpu also sounds very small [13:56:45] Okay, thanks. With only one CPU Thread pools will most likely not buy us much. Does OpenSearch rely on them? If we have a chance to use async I/O like netty, that could improve performance on single/sub CPU runtimes. [13:58:15] the cpu load shows User is 25ms but kernel is 172ms [13:58:32] if we expect it to be mostly network bound could try and just set the threadpools higher, not sure if we can change that at runtime, but i know they were in elasticsearch.yml before [14:02:00] hm... does not seem we can change that dynamically and we can't change opensearch.yml [14:02:13] can try bumping to 4cpu at the pod level and see [14:03:40] huh, indeed not seeing the .yml in the chart templates, kinda surprising [14:04:33] there's this additionalConfig thing but looks like it's not working very well, it's passing these options via "-e" to the command line [14:05:31] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1245366 [14:06:40] lesson from yesterday is that I need to be gentle when deleting pods [14:06:47] indeed :) [14:09:32] hmm, the only notes i see about additionalConfig: is that it needs to be in flat dotted notation (a.b.c = foo) and not nested [14:09:48] but it does seem to claim that should work [14:14:16] IIRC Brian did try but apprently this did not work... [14:21:06] Beware of that "interval" dropdown on the OpenSearch k8s dashboard. I think it provides a rolling average, but none of our other dashboards have that so it can be pretty confusing. Try lengthening the interval if the panels are blank [14:38:31] 4cores does not help [14:38:37] :( [14:38:53] but relforge using the same embedding endpoint does much better...very curious [14:39:41] relforge is doing 40ms p50, dse-k8s is more around 1sec [14:40:20] cpu usage is all kernel [14:40:28] ceph is running in kernel space? [14:40:54] hmm, ceph is mounted so it seems plausible all the io is kernel level [14:41:32] it's clearly not cpu bound [14:42:01] can we double the memory and see if it changes? [14:42:26] sure [14:43:31] which grafana dashboards are you using? [14:44:24] using both https://grafana-rw.wikimedia.org/d/c0a89788-c6fe-4d06-aeb2-70b63049599e/opensearch-on-k8s?orgId=1&from=now-1h&to=now&timezone=utc&var-datasource=P0AF0B00C3C579A2D&var-interval=5m&var-cluster=opensearch-semantic-search&var-node=opensearch-semantic-search-masters-0&var-shard_type=$__all&var-pool_name=$__all [14:44:31] and https://grafana-rw.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&from=now-1h&to=now&timezone=utc&var-datasource=P0AF0B00C3C579A2D&var-namespace=opensearch-semantic-search&var-pod=$__all&var-container=$__all [14:44:35] i suppose i was wondering, is this network traffic us: https://grafana.wikimedia.org/goto/Wx371xdvR?orgId=1 [14:45:29] hmm, suggests no based on that network graph [14:45:48] but the timing lines up well with elevated cpu usage on opensearch [14:46:03] very possible [14:46:36] but I don't see a strong correlation with the pod network usage [14:46:52] perhaps ceph network is not captured at the pod level? [14:47:22] maybe, will try and tease out more details in the grafana explorer [14:48:14] yes https://grafana.wikimedia.org/d/000000473/kubernetes-pods?orgId=1&from=now-6h&to=now&timezone=utc&var-cluster=P0AF0B00C3C579A2D&viewPanel=panel-36 seems definitely me testing and optimizing [14:48:34] oh yea, thats opensearch. Setting container=opensearch in the query keeps the same graph [14:48:54] i suppose i would naively say that the memory is too small, it needs more room for disk cache [14:49:11] yes [14:49:43] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1245373 should be the max we can ask within our namespace quota [14:54:52] if you need a quota increase LMK, I got the impression from yesterday that we would allow y'all to have the full amount of RAM you asked for next week? btullis LMK if that is not correct [14:55:31] sigh maximum memory usage per Container is 16Gi, but limit is 21Gi, maximum memory usage per Pod is 20Gi, but limit is 22548578304 [14:56:08] perhaps just because of how we ran things for the last decade, but those numbers all seem so small :P [14:56:24] I think Ben wanted us to do a smaller scale test with frwiki first [14:56:43] 16g seems to be the max :/ [14:57:03] Yeah, the idea was that we'd scale up as needed, and it's clear it's needed ;). I'll get a patch started for the quota increase [14:58:09] with 81G of primary indices, it's about 12G each. I guess 16 should have been most of the way there unless it's not using disk cache? hard to say [15:00:42] maybe also we should look at readahead, that was a big problem in the past for us..hmm [15:00:49] knn is going to be very random reads iiuc [15:04:24] hmm, we can't see the mount from inside the pod. `mount` reports /dev/rbd2, but ls /dev doesn't show it [15:04:44] Did they ever fix the readahead issue in newer versions of Elastic/OpenSearch? Or is it more like "this is standard kernel behavior and it works well for most other apps" [15:06:05] inflatador: i see some tickets that madvise(...) is now done by lucene, since early 2025. not sure if that's in our current version [15:06:56] https://opensearch.org/artifacts/by-version/#release-3-3-2 [15:07:13] Looks like it should be [15:07:30] depends if they updated lucene versions there, i'm not seeing which version of lucene started supporting it [15:08:34] gemini claims in lucene 10.0 it was enabled by default, and opensearch 3.3.2 is using 10.3.1. So maybe [15:10:44] That's cool. I was mainly wondering if it was even possible to set outside of a sysctl or the like [15:10:57] or y'know, a nice C program ;P [15:11:13] not really, i wrote that C program awhile ago that reaches in and does the madvise, lucene is now using FFI (foreign-function-interface) to basically hit the same sysctl [15:11:38] but hopefully more intelligently, i just applied it across the board [15:12:02] https://grafana-rw.wikimedia.org/d/c0a89788-c6fe-4d06-aeb2-70b63049599e/opensearch-on-k8s?orgId=1&from=now-30m&to=now&timezone=utc&var-datasource=P0AF0B00C3C579A2D&var-interval=5m&var-cluster=opensearch-semantic-search&var-node=opensearch-semantic-search-masters-0&var-shard_type=$__all&var-pool_name=$__all&viewPanel=panel-15 [15:12:35] seems to help a bit, I only restart -0 with 16G and latency is way lower than other nodes [15:12:35] terrible, but better [15:12:45] oh, then maybe not even terrible :) [15:13:15] 16G gives 12G spare and just under 12G of indexes. If we need the entire index worth of extra ram that's going to be a big lift [15:13:16] still bad ~194ms vs 40ms [15:13:29] yes :/ [15:13:44] it's still warming apparently [15:13:57] i suppoes what i'd love to see is a local-disk cache. Like an SSD that sits between ceph and the instance and caches things [15:14:08] yes... [15:17:22] seems to stabilize about 81ms [15:17:36] s/about/around [15:17:37] that's livable for a test at least [15:17:54] restarting other nodes [15:17:55] local storage? Nah, we just need https://en.wikipedia.org/wiki/NVMe_over_TCP ;P [15:18:21] oh and I disabled fetch the _source :/ [15:18:26] inflatador: if we can get 1TB networking to go with it, sure :P [15:18:31] need to re-enable this [15:20:02] ;) [15:21:41] shouldn't be surprised, system won't let me `mknod` the device to inspect it (since it's not in /dev) [15:21:53] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1245384 is ready for review if anyone wants to take a look [15:22:07] I fixed the problem with container limits David caught [15:22:18] +2 [15:22:30] cool, will deploy admin_ng shortly [15:23:45] /sys/block/rbd2/queue/read_ahead_kb claims read ahead is 8192 kb, which is kinda huge if it is hitting it [15:24:10] i suppose after the next restart we could change that to 512 or 128 and see what happens, but maybe it needs big readaheads to cover the massive (relatively) network latency [15:24:38] or maybe lucene is using MADV_RANDOM and skipping that entirely. In the old system i teased that info out using kernel traces, but i don't have that kind of access in k8s [15:25:02] (it also took like 2 or 3 full days of toying around to figure out the right places to trace :P) [15:25:11] :) [15:26:26] so the sad part is that our early estimates were very wrong and half the shard for filecache is definitely not enough [15:26:27] i suppose we could also try apply the madvise-random program, but that probably needs root in the container [15:26:55] if there are tricks to limit IOs that'd be great [15:29:23] FWiW, we ran an entire DBaaS product off 20 Gbps iSCSI, so it is possible to get decent performance with remote storage...just not sure what levers we have in our environment [15:31:16] i suppose my next steps would be trying to reduce the readahead and see if it matters. Don't have strong proof it's the cause, but it's been the cause before and the 8MB read-ahead reported in /sys seems suspicious [15:31:46] knn is a highly random-access thing, most of that 8MB is probably useless for answering the current query [15:33:07] admin_ng is deployed, feel free to bump up to 32 GB or whatever you need now [15:33:19] thanks! [15:34:32] i also see some notes that knn can be off-heap. I would guess that anything off-heap is not benefiting from lucene internals around MADV_RANDOM [15:35:38] so many tools missing...i also can't use the kernel page-types tool :P [15:36:03] If you need me to be your hands and run some `nsenter` commands LMK. [15:36:26] page-types is a tougher one, whoever compiled the kernel has to compile and include it, and we don't by default [15:36:49] but we might want to change the readahead in /sys after david reboots the cluster [15:37:50] We can also set sysctls on the k8s workers with puppet, assuming they aren't going to hurt other workloads too much [15:38:18] i suspect for many things a big readahead probably does work, particularly things like spark. But (probably?) spark wont be using ceph [15:41:24] ok, a lot better, we can do ~25 qps after warmup with 16g (12g for filecache and a shard size of 13.4g) [15:41:37] so, as long as we don't involve ceph :P [15:42:22] it's ~100ms for liftwing + 220ms for opensearch [15:42:39] oh curious, actually network traffic is still quite high [15:43:36] yes... saw this, hoping it's just an artefact of the average while it was warming up [15:44:42] yea possible [15:45:34] well, i guess that graph is fs_reads, maybe it's not all network [15:45:48] oh [15:45:50] but i guess i'm thinking the reads come from network, but maybe that also includes reading from page cache? i dunno [15:48:50] this might be actual network traffic? but curious it was low before: https://grafana.wikimedia.org/goto/SdaXBbOvR?orgId=1 [15:49:10] err, oh. It's all TX, no RX :S [15:49:40] https://grafana-rw.wikimedia.org/d/f1e5bb5b-7fff-40ec-9cb3-39ed7510e81f/ceph-pools?orgId=1&from=now-6h&to=now&timezone=utc&var-datasource=000000026&var-cluster=cephosd&var-site=eqiad&var-topk=15&refresh=5m [15:49:42] :/ [15:50:17] hmm, so very little according to ceph? [15:50:52] we might need cluster=cephosd there? [15:50:57] I still see 3.68Gbs [15:50:59] yes [15:51:12] yea still very high there [15:51:36] only master-1 on the graph you shared [15:52:15] dcausse: i think my graph might be off, it's only reporting TX. Maybe master 1 is the one you are querying? [15:52:26] possible? [15:52:36] but ceph would be receives, under RX. No clue why that dashboard only has transmits and not receives [15:53:32] oh...the dashboard is just mis labeled :P the receives bytes is still flagged TX, so they are there [15:53:53] but yea, it's only the one pod doing heavy receives. weird [15:54:41] latencies are roughly the same across all nodes, weird... [15:55:10] but that 3Gbps seen by ceph is concerning [15:55:43] i suppose on the upside, for the test they want to run this will "work", it's just wasteful [15:55:43] masters-1 must be the coordinating node [15:55:58] can we do enwiki? [15:56:19] well, hmm...maybe :( [15:56:39] not with 3 replicas I'm afrais [15:56:47] *d [15:56:57] inflatador: can we try changing the /sys/block/rbd2/queue/read_ahead_kb ? perhaps take it down from 8192 to 1024 or 512 [15:57:19] the exact device might vary, thats just masters-1 [15:57:52] i think that has to be done on the host, from the container it's read-only fs [15:58:37] Yeah, I can give it a shot. At the risk of being obvious, I'm not sure if changing it on the worker will necessarily change what ceph does [15:58:59] yea it's hard to say how it all works together [15:59:32] OK, looks like master-1 is on `dse-k8s-worker1001.eqiad.wmnet` [15:59:58] masters-1 is also the one showing very high network vs the others, so seems a reasonable place to test [16:03:30] i suppose the other oddity i can't explain, masters-1 shows high RX and TX, but it shouldn't really be pushing much data out [16:13:27] doh, i guess i should have been watching my emails better. apparently left puppet turned off on relforge for awhlie. fixing [16:21:05] ebernhardson OK, ran `echo 512 > /sys/block/rbd2/queue/read_ahead_kb` after verifying that was the right one for `masters-1` [16:21:19] inflatador: thanks! will see if the graphs move [16:21:55] i suspect dcausse might need to turn the load test back on? or it's really helping [16:22:04] sure [16:22:08] yeah, if you know a fast way to map from k8s API -> rbd let me know, I just fished around with findmnt and the pod UUID from `k get po -o yaml` [16:23:13] ebernhardson: restarting it but if you need to run it after it's there: https://gitlab.wikimedia.org/dcausse/semanticsearch_demo/-/blob/main/locustfile.py?ref_type=heads [16:23:15] i have no clue :P i would probably poke around kubectl describe or some such [16:23:29] dcausse: oh cool, thanks [16:23:38] running from a stat host? [16:24:11] from stat1009 [16:25:04] it needs pip install locust and the opensearch client [16:25:22] this is the report I just captured https://people.wikimedia.org/~dcausse/semsearch_load_test/frwiki_7_nodes_4cores_16Gb.html [16:26:49] Looks like the pods (or at least masters-1) is still at 16 GB RAM? [16:28:08] they are, I asjusted them manually while you increased the quotas [16:28:29] happy to bump to 21G if we want to test? [16:29:04] ceph is still showing 3.6GB/s :S [16:29:13] so at least that readahead setting doesn't seem to have a top-level effect [16:29:16] :/ [16:30:07] need to pack some stuff, back in a few, let me know if you want to try 21G [16:30:34] i can prep that, i should actually try restarting the clusters too to make sure i know how it's done [16:32:13] yeah, feel free to go up, y'all can get pods up to 32 GB now. Procedure is at https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/OpenSearch-on-K8s/Administration#Changing_Resources_on_a_Live_Cluster but feel free to ping me if you have issues [16:38:12] would you like me to leave the masters-1 readahead value at 512? [16:38:25] inflatador: nah lets change it back, doesn't seem to have had an effect [16:38:48] well, maybe a small effect, but not enough to solve the problem [16:39:18] OK, reverted [16:47:14] parent-teacher conference/lunch, back in ~2h [16:57:17] Got my eyes dilated at the eye doctor. Can't read at the moment, so I'm going to take a half day. [17:05:34] fun! [17:57:55] more memory had no difference on ceph usage :( [18:00:39] oh, no i'm a dummy. values.yaml had 21G but not applied yet. Curiously the graphs did briefly show the raised limit [18:17:07] perhaps we could still have replication but force searching on primary shards with preference=primary we should always hit hot shards, rather that having to fit all replicas in mem [18:22:56] curiously, the increase to 21G may have actually eliminated most of the ceph io, it's running now but ceph is reporting minimal reads [18:23:07] well, maybe not so curious, it's just finally "enough". But it's alot [18:23:44] i suppose next step would be to turn up replication to 1 primary 1 replica, and see if it tanks? [18:24:06] yes shards latencies are <10ms [18:24:33] you'd need to increase ceph volumes for this [18:24:44] although maybe i didn't run it the same, i used `locust -f locustfile.py --headless` [18:25:34] oh yes I have a web ui where I can reset/stop/start [18:25:36] stopping mine [18:25:59] I looked at https://grafana-rw.wikimedia.org/d/c0a89788-c6fe-4d06-aeb2-70b63049599e/opensearch-on-k8s?orgId=1&from=now-30m&to=now&timezone=utc&var-datasource=P0AF0B00C3C579A2D&var-interval=5m&var-cluster=opensearch-semantic-search&var-node=opensearch-semantic-search-masters-0&var-shard_type=$__all&var-pool_name=$__all&editPanel=15 [18:26:09] oh, indeed now it's running way faster :) i for some reason thought yours stopped, but that was just the improved memory i guess making the graphs drop [18:26:11] which clearly shows when you bumped to 21g [18:27:03] tbh if we sent preference=primary the latency are low enough that we might not need replication for throughput but only for recovery [18:27:41] we might fit all primary shards in mem? [18:27:53] hmm, maybe. I'm going to add the replica and see, then can try adding preference=primary [18:28:04] sure [18:28:15] beware that you might fill up the disks :/ [18:28:52] oh, i totally didn't think to check [18:29:16] when I optimzed earlier today I almost fill the disk [18:29:36] yea 15.7G free and 13.6G used, it's going to hit all the watermarks if i turn on a replica [18:30:32] can i just change the disksize in values.yaml? not sure what that will do [18:32:46] seems like it? https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/OpenSearch-on-K8s/Administration#How_to_Expand [18:32:47] dcausse: if i break everything and need to reload the index, how does that work? [18:33:07] sure, I was about to share the notebook [18:33:09] with no replicas it seems a plausible outcome :P [18:33:14] :) [18:34:56] hmm, the doc says it should be visible immediately, it does seem to be working it's way across [18:35:04] (i just used --set w/ helmfile apply) [18:35:43] perhaps it needs yet another rolling restart? [18:35:44] i'm mildly surprised it can do that while running, i suppose i would have expected some difficulty with the fs-layer, kinda cool [18:36:03] it just seemed to take maybe 20s per instance, it slowly worked across. /_cat/allocation now reports 49G everywhere [18:36:36] just applied the extra replica, will see what it does [18:37:46] ebernhardson: stat1009:~dcausse/notebooks/semsearch_prep_index.ipynb [18:38:02] thanks! [18:38:33] there's a cell that sources the embedding partition [18:39:02] another cell is just writing a ndjson somewhere [18:40:13] then I use hdfs dfs -text /path | split -l 100 --filter 'curl -u opensearch:changeme --data-binary @- -HContent-Type:application/x-ndjson https://opensearch-semantic-search.svc.eqiad.wmnet:30443/frwiki_content_20260215/_bulk' [18:41:44] the index settings/mappings are there https://phabricator.wikimedia.org/P89064 [18:42:20] Namespace opensearch-semantic-search has reached 85% of its limits.memory resources :) [18:45:52] heh, ouch. Recoverys finished, nodes added, everything tanked :P [18:45:58] s/nodes/shards/ [18:46:10] will stop it and test the primaries bit [18:46:54] an annoying detail I forgot to comment: discovery.wiki_content_embeddings has snapshots set to data_interval_end while cirrus is data_interval_start so it's normal that it's trying to join cirrus@20260208 and the embeddings partitin from 20260215 [18:47:45] fun :) [18:48:03] requests are slowly speeding back up with preference=_primary, it's now back to "full" speed. so that is a plausible solution [18:50:25] i guess i'm a bit curious...going to try and move the memory down to see what it actually needs. Although not sure how much that tells us [18:52:14] also a bit curious if it will naturally re-balance the primaries after a promotion, or if it gets into an awkward state for preference=_primary [18:54:24] there's _primary_first which seems to suggest a primary shard may not be available, not sure in what condition this happens [18:55:22] the primary got promoted, but now masters-3 has 2 primaries, and -0 has 2 replicas [18:55:43] so the pref only works if the primaries are actually spread out [18:55:43] sigh ofcourse :( [18:56:32] but the requests at least keep running while yellow, so it is probably immediately promoting the replica [19:01:24] perhaps index.routing.allocation.total_primary_shards_per_node=1 ? [19:01:33] hmm, yea maybe [19:01:47] will finish the restart and then see if that helps [19:01:55] reading these settings I see index.number_of_replicas & index.number_of_search_replicas but failing to understand the difference [19:03:34] ah seems opensearch specific https://docs.opensearch.org/latest/tuning-your-cluster/separate-index-and-search-workloads/ [19:03:39] docs for total_primary_shards_per_node: This setting is applicable only for remote-backed clusters. [19:04:01] which is the bit where shards are stored in S3 or whatever, and pulled to the local disk to run [19:04:33] still can't hurt to try [19:04:40] last instance restarting now [19:06:43] yea, can't set that: [{"type":"illegal_argument_exception","reason":"Setting [cluster.routing.allocation.total_primary_shards_per_node] can only be used with remote store enabled clusters"}], [19:08:42] i'm sure i could write something that pings the cluster every minute and tries to move things when the cluster is green...but seems a terrible idea [19:09:28] although...actually i'm not seeing an option to promote a replica [19:13:29] back to 0 replicas to at least re-balance things... [19:15:24] well, at least it tells us 18G is enough, 16G wasn't. huge difference in just 2gb [19:17:00] looks like you're in the vicinity, but Ops Week item subject:"ResourceQuotaMemoryRequestsCritical data-platform (opensearch-semantic-search k8s-dse critical eqiad prometheus)" [19:17:02] https://grafana.wikimedia.org/d/ca9c0221-4a0d-4833-865b-f14a3e813c97/kubernetes-resource-quotas?var-ds=thanos&var-namespace=opensearch-semantic-search&var-prometheus=k8s-dse&var-site=eqiad&orgId=1&from=now-6h&to=now&timezone=utc [19:17:12] Seems to be pegging out the memory [19:17:12] err, no helmfile is showing it was still set to 21G :S I'm sure i applied that (and it's in my bash history) though....maybe more memory is failing [19:17:41] dr0ptp4kt: yea, the system is trivially burning through all the memory, it doesn't like ceph much [19:18:12] ebernhardson :sighs and fixes stuff: :) [19:18:28] * dr0ptp4kt thankful ebernhardson on it [19:18:52] dr0ptp4kt: it will probably complain all day, mostly load testing and trying to figure out if there is a config that works [19:19:15] it does <100ms when happy, ~600ms when unhappy, and >=1s when really unhappy [19:19:33] but happpy means index size == memory overhead, which is a lot [19:23:20] yeah. i was thinking maybe the aws folks had an approach here that's satisfactory for them and customers. i see https://docs.aws.amazon.com/opensearch-service/latest/developerguide/multi-tier-storage.html , although exact translation to our infra isn't necessarily the same and i'm unsure how they back the managed os and its disk precisely [19:23:51] i suspect our more ideal solution is an ssd cache layer between ceph and opensearch on the k8s hosts, but not sure thats possible in our setup [19:24:31] right, a large enough persistent volume with that could maybe work - how big is the disk page on the bare metal? [19:24:31] otherwise we get to ~4GB/sec of ceph traffic which looks like a limit somewhere [19:24:40] not sure, i can't see that [19:26:20] i also wonder about readaheads, i suspect ceph is pulling way more data than we need (knn is mostly random access), but also not finding where i can set that. We tried updating readahead in /sys but didn't help so maybe not [19:26:41] but entirely plausible that settings in /sys is only one piece of the readahead puzzle [19:34:10] it seems ~17Gb is the breakeven on memory, any less and it slows down. But but great because with 4G heap, that means we saved 13G for the disk cache, and the shards are 13.5G. We can't just 1:1 memory and index size [19:37:14] back [19:42:05] Do y'all need a quota update for the semantic search ns? [19:42:32] inflatador: the quota is probably fine, the problem is that it only seems to work well if we have a gb of free memory for every gb of index size [19:42:44] so for example, changing from 0 to 2 replicas means 3x the memory [19:43:36] probably wrong, but i'm still poking at readahead options :P [19:52:07] We could probably try a deploy using emptyDir (local storage) just for contrast, but I don't think we have the disk space to do that at the 3-wiki scale [19:53:16] inflatador: is there any option to stick a local-disk cache between ceph and the pod? [19:53:27] then at least we can know the size ahead of time, instead of trying to fit the full thing [19:54:14] emptyDir can't hurt, i guess i would let us turn down the memory and see how it performs when ceph is skipped [19:55:53] No option to for a local disk cache AFAIK. Sounds like Ceph maxes out at about 4 Gbps? [19:57:01] yea, but not sure which end that cap is on [19:57:24] i suppose i'll turn the memory back down a bit, right now it's at 17GB and able to fit it all in memory [20:06:54] inflatador: oh a random idea that occured to me...back in the day we noticed the mmap's copy the readahead from the block device into their own memory. Maybe we could try lowering the readahead in /sys and restart the intsances? But i don't know if the readahead would keep [20:07:01] or maybe i can just close the index and reopen, not sure [20:08:05] based on T416881 it looks like we have about 100 GB of free space on each K8s worker, although I would be very reluctant to use more than 1/2 of that for any period of time. Would 50 Gb/pod be enough for the frwiki tests? [20:08:06] T416881: OpenSearch on K8s: Discuss storage-saving options - https://phabricator.wikimedia.org/T416881 [20:08:35] inflatador: the indices are ~14g/pod (with no replicas), so that would work [20:09:12] it looks like `blktrace` might be able to tell us if readahead is happening, but somehow i doubt that's installed, no mention in puppet repo [20:10:07] OK, will have to clear with the rest of the team next week, but that's a lever could pull. That could make rolling restarts a bit more difficult though [20:11:05] inflatador: it wouldn't work for long term, it would only really tell us how much the extra network layer is costing i suppose [20:11:23] the expected size with replicas and enwiki and such will be too much [20:11:28] Yeah, although I think that would be good info [20:11:37] yea [20:11:58] Re: blktrace, looks like it's in Debian repos. I can one-off install it on a k8s worker if you like [20:12:37] inflatador: i don't really know much about it, but various docs claim it can read live readahead info: https://www.ibm.com/docs/en/linux-on-systems?topic=blktrace-data-io-requests [20:12:54] otherwise bpf can also probably do it, but i've never played with bpf :P [20:13:29] randomly searching I see that ceph has its own readahead settings https://docs.ceph.com/en/reef/rbd/rbd-config-ref/#read-ahead-settings [20:13:59] ok pods are now all 15G memory, they were "happy" at 17G, so in theory if something reduces the io usage it should be more obvious. maybe [20:14:27] latency on 15G is about 2.5x latency at 17G [20:14:58] I know there is QoS to throttle ceph traffic, but it supposedly only kicks in when everything else is under duress [20:17:15] ceph io is currently ~1.75G, but not sure how much smoothing is in that. It climbs slowly [20:18:14] we were at 4 earlier, so it does seem to depend on available memory at least a little bit [20:33:02] inflatador: if you have some time, lets try and reduce the readahead again and close/reopen the indices. [20:35:06] i suppose ideally we would want to change that on all pods, pondering how to notice if one pod is happier. Probably by using preference=_shards:0 to limit the query to a single shard [20:36:59] ya that works, we can just change 1 pod [20:40:29] ebernhardson just got back, doing it now [20:41:04] inflatador: thanks! just lemme know which one [20:41:30] it looks like right now we see ~340MB/s with one host querying [20:43:20] ebernhardson ACK, just set it on `masters-1` [20:45:03] hmm, latency is a bit better with the close/reopen. Not as good as before, but now maybe 1.2-1.5x slower than with more than enough memory [20:45:17] probably have to wait a few minutes for the grafana graph to catch up [20:46:21] yea, we are at ~125MB/s instead of 330MB/s before the change [20:46:31] inflatador: try lowering it again? [20:47:36] actually it kept going down, now ~60MB/s [20:47:52] and latencies are right arround 100ms, which is where we were with excess memory [20:48:46] That means you're getting the same performance but less network traffic? [20:49:33] inflatador: without readahead lowered we were ~250-300ms, now we are at ~100ms. 100ms is also where we were at when we had 17G memory (currently at 15G) [20:49:39] i should probably make a table of this info :P [20:50:07] can pull down memory to see how far it can go, but was thinking first to test how low readahead can go before the overhead of multiple-requests is worse [20:50:27] io keeps declining, graphana now reporting ~30MB/s for the one host [20:50:33] (it's only querying a single shard right now) [20:51:57] Interesting. I wonder if there is an option in the Ceph CSI provider for this? 99% sure that's not gonna be exposed in the current helm chart [20:52:17] yea we will need to figure out how to make it more permenant [20:52:23] but first step, just finding what works :) [20:53:22] gemini did suggest we could define a separate StorageClass in helm, and we could set ra there, but i'm not sure how much it actually knows about that [20:54:07] Yeah, ChatGPT claims the Ceph CSI provider doesn't expose that [20:55:14] I'm optimistic someone in dpe-sre will have good ideas, just needs a few eyes [20:55:23] My guess that it would negatively affect most other workloads or the kernel would have it set down automatically, but I really have no ideas. [20:56:00] i suppose my hope (not really sure) was that there is a 1:1 mapping between block devices and pods, so as long as its set on "our" block device and not all of them, it should be fine [20:56:39] but at least in classic linux fashion, one block device maps to one filesystem, multiple filesystems in multiple pods would have separate block devices registered [20:57:22] that's correct. If this was a 100% OpenSearch cluster we could just set it everywhere [21:00:56] deleted pod 1 and bringing it up with 13G now (-2gb). Wonder if the block device settings will keep [21:01:24] nope, it's reset :( [21:03:38] fixed [21:04:24] this whole workflow reminds me of tracking down swappers in the rackspace days. Mapping a nova UUID to a XenServer UUID to a block device path. Not sure if that makes me happy or sad [21:05:04] currently at ~200ms with 13G, but i suspect it needs more warmup time [21:06:01] yea, it doesn't seem like a great workflow [21:07:36] down to ~130ms [21:08:10] Oh yeah, no complaints. I guess it's "the more things change the more they stay the same" kinda vibes [21:12:31] i/o doesn't seem to be declining anymore. So 15G needed ~30MB/s, 13G needs ~75MB/s. Latency increases about 20% between the two [21:12:37] but even 13G is probably a bit big [21:13:08] 9gb disk cache vs 13.5gb of indices, about 66% [21:13:20] i guess lets try and pull it down to 50% [21:15:00] restarting masters-1 again [21:16:14] We couldn't make this problem go away with enough RAM, could we? Like if we gave each pod 64 GB RAM? [21:16:49] inflatador: sure we can, thats the test that worked earlier today. But that's a lot of ram. enwiki would need 2TB [21:17:43] damn [21:17:44] ohh, maybe not, i think i mis-read the spreadsheet. but i think the three test indices (enwiki/frwiki/ptwiki) would need just over 1.1TB [21:18:45] inflatador: can we set the readahead, yet again? [21:18:57] ebernhardson done [21:19:31] I got lazy and didn't look up the rbd since it seems to be staying the same, if you don't get expected results LMK [21:19:58] it seems to be better, but needs a few minutes to stabilize io [21:20:47] it is the same rbd after all, I'll make some notes [21:21:45] ebernhardson sounds like we'd just need enough memory per pod to fit 1x shard of each wiki? (~34 GB)? [21:22:13] looks like roughly, at index:memory ratio of 100% ~100ms, at 66% ~130ms, at 50% ~190ms [21:22:41] with io of 30MB/s, 70MB/s, and 120MB/s respectively [21:22:50] inflatador: try setting 128? [21:23:05] (i think we've been testing 512) [21:24:35] ACK, just set to 128 [21:26:19] does seem to have maybe brought down latency again, from 190ms to 160ms at the 50% index:memory [21:26:29] have to wait to see what IO does in grafana [21:27:57] io potentially declined from 120MB/s to 60MB/s for same workload [21:30:00] inflatador: i guess try again at 64? I guess i was looking for when the latency/io turns back around and gets worse. At 128 it seems to have stabilized at ~50MB/s for the single node workload. [21:31:11] after warming up w/128kb readahead the 50% index:memory is down to ~150ms [21:31:26] ebernhardson FWiW I've never done 64, maybe I misunderstood you. 128, 512 and default is all I've done [21:31:40] inflatador: right, i was asking to bring it down again, this time to 64 [21:31:53] or maybe 32? basically i'm trying to find where it starts getting worse from lowering the readahead [21:31:55] ebernhardson ACK. Just set it to 64 [21:31:57] so far every change is just better [21:32:58] It also occurs to me that we would be able to set readahead in a VM as opposed to a container [21:33:08] yea usually [21:33:34] latency w/ 64 is similar or maybe even slightly worse. So at least we found it, maybe :) Going to let it run a few minutes [21:34:29] I/O might have declined again, to ~40MB/s, actually latency is now coming in fairly similar, right around the 150ms mark [21:35:06] but it needs like 5min to stabilize and be more sure [21:38:07] yea latency slowly climbing down, avg at 135ms now. I guess i should have been tracking more though...noticing the median is 500ms probably should be looking at avg and median [21:39:39] should also probably test different parallel request rates. Ive been mostly testing with 1 parallel request. But at least we know what we need to look into [21:42:45] Yeah, I guess I'm still thinking in terms of throwing HW at the problem. Do you think it's worth it to try with a 32 GB pod or something? [21:43:34] yea 10 parallels is much different, io is up from ~40MB/s for 1 parallel to 150MB/s for 10 parallel [21:44:20] inflatador: i'm not opposed to throwing excessive memory at it, i just worry we don't have enough for the full test. We would need just over a TB of memory for the best results we've seen [21:46:54] ebernhardson ACK, that would be 1 TB **just** for the frwiki test, right? Not for the 3-wiki test? [21:48:00] inflatador: would be the three wiki test, basically in david's spreadsheet has has index size (with replicas, iiuc) at 1073GB, and the best results so far are 100% index:memory ratio. With lowered readaheads we can probably get by somewhere in the 50-80% range [21:48:26] at least, 50,66 and 80% are what i've tested and they seem plausible [21:48:33] shouldn't be too hard to set read ahead from the image boostrap script? [21:48:44] dcausse: it's read-only from inside the container [21:48:55] ah sigh... [21:50:48] y'all are approved for 24x 32 GB pods for the 3-wiki test. I'm making some notes on how to map from `kubectl get pod` to the specific rbd as well [21:51:09] thats 768, so ~ the 80% test. Probably works fine [21:51:21] (with lowered readaheads) [21:52:13] oops, I screwed that up [21:54:02] I wonder if we could hack the operator to do something like that? [21:54:52] i'm not really sure, but i feel like readahead is way too generic to not be supported, there is probably something just need to dig into how those bits work. I can probably start doing that monday if none of our EU colleagues just happen to know how it's done [21:56:23] i'm going to head out for the weekend in a moment here, but at least we have good lead on whats needed and a suggestion it will probably fit in the 768 [21:57:01] although we've only been testing single shard queries, it probably also applies to the full cluster once we have a handle on how to apply those readaheads across the board [21:59:08] I can ask the ~1 TB memory as well. I'll make some notes on how to map the RBDs as well. Ideally that would go into the operator and/or the CSI plugin [21:59:42] Is testing with local storage still useful? [22:00:01] https://wikitech.wikimedia.org/wiki/User:BKing_(WMF)/Notes/Opensearch-on-K8s-rbd-mapping WIP [22:00:36] inflatador: hmm, hard to say. I'm much more happy with the numbers we are seeing now than we were with the full 8MB readahead, those numbers were painful. It probably wouldn't hurt as a data point, but probably less necessary as it now looks like ceph will at least plausibly work [22:02:21] and i feel like we've pinned down the major difference was the size of those readaheads, a typical local disk is probably (from my random experience) in the 128-512 area [22:02:44] ok gotta run, see ya'll next week [22:02:59] ryankemper got anything for pairing? I'm just writing up ^^