[08:51:08] going to force re-index archive indices to make the list in T423993 a bit more manageable [08:51:09] T423993: Upgrade old indices in the CirrusSearch opensearch clusters - https://phabricator.wikimedia.org/T423993 [10:29:10] lunch [12:39:09] m [13:03:34] \o [13:32:35] was trying to remember, what was the model choice going forward for semantic? Thought it we jena but not finding in my history [13:38:00] o/ [13:38:25] ebernhardson: I think jina-v5-nano performed well [13:41:09] ahh, thanks. I couldn't remember how it was spelled and search wasn't helping [15:03:04] meh still 1.7k old archive indices after a re-index... must be stale indices or I don't get what's happening [15:03:48] ah no, was looking at an old log file... down to 385 old indices across all prod clusters [15:04:23] nice! [15:06:06] sigh, memory helps the dps service, but it still fails occasionally :S Error: getaddrinfo ENOTFOUND cirrustestwiki.mediawiki.local.wmftest.net [15:06:22] i guess just give it more memory [15:06:32] :/ [15:20:39] upped from 200m to 300m, it feels small for a jvm but a dns server is also not that complex [15:21:56] feels odd to me to run a jvm with less than a 1G :) [15:42:28] sigh... seeing top_queries-2026.05.20-84955 indices on cloudelastic [15:43:36] hmm, _cat/plugins doesn't suggest anything interesting, i thought that was part of their plugins and not in core :( [15:44:34] same... [15:44:56] they only have a single index, maybe one thing creates it, and the plugin populates it? [15:45:29] or the cluster was live a short period with those plugins? but 2026.05.20 is fairly recent [15:46:02] going to delete them [15:46:13] o/ [15:46:22] Yes, it was live for a short time with the plugins [15:46:31] also check-indices seem to miss a bunch of cirrus indices... [15:46:56] dcausse: which ones? [15:47:00] Surprised it would still be active as recently as yesterday though, especially since I rebooted everything [15:47:43] ebernhardson: for instance: codfw,port:9443,index:zhwikinews_archive_1717123355,version:7100299 [15:47:52] true, it's missing an index for may 21st, maybe there were more plugins before the reboot somehow? [15:48:24] I don't see zhwikinews_archive_1717123355 reported in check indices output [15:49:11] dcausse: hmm, maybe i completely missed archive indices in there? I'll have to check, it should be whatever comes from Connection::getAllIndexSuffixes [15:49:40] and that probably only returns content/general/file [15:49:55] ebernhardson: I think archive are handled because it reports some of them like ruwiki_archive_1779376775 [15:49:55] oh wait, no it should return archives :S I'll have to dig in [15:51:03] i suppose there is a separate question, maybe their query insights dashboard is useful? It looks like the structure is it should emit 1 doc per minute [15:51:28] but i suspect it would be a mess in our prod clusters, unsure [15:51:37] (mess like too much variance) [15:52:30] creating a ticket to check the archive bits so we don't forget [15:53:01] yes... a bit afraid of the overhead for this plugin to track actual live traffic, not really the volume of these indices [15:54:15] hmm, yea that's potentially a problem too. I would hope they do very minimal work on a per-query basis but would have to look deeper to know [15:54:40] I did briefly disable the systemd plugin filtering when we were having all the I/O issues, thinking that might have been part of the problem. I guess I must have forgotten to restart one of the nodes after removing the plugins or something [16:08:25] hm my bad zhwikinews_archive_1717123355 is actually live, might have been a failed reindex [16:09:55] ah mwscript-k8s has lost its link to k8s, checking if it's running [16:10:42] they're still running indeed [16:16:59] hmm, annoying edge cases around networks, who would have guessed! :) [16:46:41] sigh annoying that mw-script --dblist=all stops on the first failure... it was so close to the end [16:47:07] ah, it has --local_dblist which will be handy [16:51:48] poking around, it looks like the default enabled options for the top_queries / query-insights bits should be pretty cheap. There are options that can be turned on to make it much worse, but at least the default data collection looks efficient [16:52:36] there are probably a few default things we could turn off since i suspect "top-n slowest queries" is just going to point out things we already know [16:55:40] one potentially useful bit (that we've seen, but very rarely) is diagnosing "why is the cluster on fire" via _insights/live_queries, although we kinda-sorta have that data through the tasks api [16:57:21] sure, no objections to try it out but we may end up wanting opensearch dashboards :) [16:58:58] true, although maybe can do something silly like run the dashboard locally and port-forward it into the cluster [16:59:12] some docker container [17:21:39] going to close the metastore indices before deleting them just in case we forgot something [17:21:50] +1 [17:27:39] meh, apparently my signing key for maven releases has expired...now to remember how to updat eit [17:28:02] [expired: 2026-05-05] [17:42:41] annoying... [17:44:26] turns out way easier than i expected. gpg --edit-key and run expire followed by save, then push to keyservers [17:44:39] well, will see if central accepts it [17:48:31] ok closed all metastores, nothing exploded, will delete them tomorrow [17:48:36] dinner [17:51:36] looks to have been accepted, or at lease release:perform didn't fail. Will have to wait a bit for it to work through the nexus pipeline [18:19:29] nice, yup it was that easy. fully published [18:28:34] huh, did not expect that from gitlab. Fetching `https://gitlab.wikimedia.org/repos/search-platform/opensearch-plugins-deb/-/archive/2.19.5+4-trixie/random-does-not-exist.zip` gives an 200 OK and sends you the file opensearch-plugins-deb-2.19.5+4-trixie-24eb01ed9ec80f7b10621f9a8d94615ed2ce1660.zip [18:28:52] i guess our check in the plugins.deb preparation can't use that anymore... [18:31:03] DWIM for repos? [19:37:47] lol [19:37:58] looked into it, apparently everything after the version number is ignored [20:00:59] Trey314159: Doe sthis seem like a reasonable title/desc for a DPE deep dive? https://phabricator.wikimedia.org/P92811 [20:02:08] i guess the problem is, i've described the patch as opposed to what they would get out of listening. But i'm not really sure [20:03:46] oh, duh, trey is off till tuesday :P [20:30:24] inflatador: maybe you have some thoughts? I'm thinking the second version here might be reasonable? https://phabricator.wikimedia.org/P92811 [20:31:42] I prefer the 2nd version as well [20:32:19] I worry though, an LLM didn't write it but i still did the A, B, C list :P [20:32:47] it's annoying that things we've done for a long time now look LLM-y [20:33:06] thanks though, i'll focus on getting the second one better [20:33:41] It's very likely your code and prose have trained a ton of LLMs [20:33:52] ;) [20:35:50] ebernhardson: I peeked my head in for a moment, and the description looks good and the title is excellent! [20:37:13] Trey314159: thanks [20:42:20] my turn to ask something ;). I'm in mid-refactor of my tmux dashboards for opensearch maintenances. Do these seem like the top 4 API calls you'd want? I'm thinking of replacing `_cat/alloc` with `_cluster/allocation/explain` https://gitlab.wikimedia.org/repos/search-platform/es-maint-viewer/-/blob/main/cirrus_codfw.yml?ref_type=heads#L10 [20:42:34] looking [20:43:38] inflatador: i agree allocation explain would probably be better than _cat/alloc, with the goal of not just knowing something is wrong, but what is wrong [20:44:03] i've found it verbose and hard to read though, but it's got the more valuable debug info [20:45:13] yeah, I feel like I just glaze over _/cat/alloc and I have to go the more verbose call to really get what I need [20:45:41] If there are more useful calls besides those LMK, I'm trying to get something anyone can run on the deployment server [20:46:13] i should really put my infinite history patch in my puppet profile, then i could just history | sort | uniq some variant :) [20:50:25] It'd be fine, how much space could a few million commands take up? ;P [20:51:07] in my history its mostly _cat/health, _cat/recovery?active_only=true, /_cluster/reroute?explain=true, [20:51:51] there are a few that use `_cat/shards | grep -v green` [20:52:49] oh yeah, maybe `_cat/shards | grep -v green` would be more useful than `_cat/health`. The cook-book already runs `_cat/health` constantly anyway [20:56:10] deep dive scheduled for Jul 28, now i gotta remember to actually put something together (not too slideshow-y, but probably some graphviz and an idea of the order of talking) [20:57:35] nice [20:58:09] I think I'm gonna turn up the number of active recoveries when I reboot eqiad starting tomorrow. codfw has taken more than a day and still has 18 hosts left