[10:02:11] pfischer: if you have a sec https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1207806 I'd like to backport this asap [11:04:24] lunch [14:07:26] o/ [14:57:52] \o [14:58:34] was pondering how relforge would work..one problem with a plain post-install role is i suspect it will leave the instance mostly firewalled off...needs some testing [15:08:52] ebernhardson: thanks for looking into this! [15:09:24] certainly, i suspect we can get something working but it might be a little awkward, will find out [15:51:52] I can't make it on time for our retro, will be 30' late. [15:52:18] I have a conflicting mtg, so won't be at retro [16:33:52] Sorry, I am stuck with Luise, my with is stuck in traffic. I can't make it to the retro. [16:34:13] no wories [17:19:35] err, doh...looking at puppet some more realizing if we put a generic post-install role on relforge we wouldn't be able to ssh in [17:33:31] hmm, not finding anything. Do we have some sort of ticket related to opensearch 3 on relforge, or just opensearch 3 for testing vectors in general? [17:38:52] ebernhardson no, but I'm working on https://phabricator.wikimedia.org/T407123 , hopefully when that's done we should at least have a way to install it. Would it useful/possible to install OS3 in beta cluster? [17:39:42] inflatador: oh sorry, i realized we talked about this without SRE around. We were trying to move a bit faster, so the idea was to set relforge to role::common::test and manually install it so research has something to experiment with [17:39:44] on relforge [17:40:16] i'm not sure how many things that mucks with though, we mention relforge in a variety of related areas (like defining spicerack clusters), and not sure if that all has to go away [17:41:27] would a throwaway opensearch on k8s environment be useful? We've talked about it in https://phabricator.wikimedia.org/T405246 [17:41:55] inflatador: i think the thing they need is enough cpu and ram, because they are going to be loading significant amounts of vectors [17:42:01] All of it is kinda blocked on the opensearch repo stuff [17:42:18] it also has to be opensearch 3 [17:42:21] ebernhardson ACK, I'm guessing 16GB/pod is probably not enough RAM? [17:42:33] hard to say, but probably not. [17:42:56] We don't have much experience with vectors so i'm not really sure, but ideally you want the entire dataset to fit in linux disk cache, but i have no idea how big that will be [17:43:04] np, I need to fix the repo stuff anyway. I'm working on it now [17:43:39] inflatador: would the idea be to install puppetized opensearch 3 on relforge? [17:44:13] i guess my intent was set it to test role, manually install .debs, edit config ,etc. Not reproducable in any way, but enough to get the testing going with the intent we re-image once done [17:45:41] ebernhardson no, I'd be doing the repo stuff mainly to unblock OpenSearch 3 docker image builds. Puppetized OpenSearch 3 is not on the radar ATM, but you're welcome to create a role if you like. I also don't mind if you wanna do gross one-off stuff to relforge, running OS from docker images or whatever [17:46:11] so long as we created said docker image, that is ;) [17:47:12] hmm, yea that might work too. I'll continue going over puppet then to figure out what needs to be configured/unconfigured for the switched role [18:21:47] * ebernhardson wonders if TLS is going to be an issue [18:25:09] You can disable TLS in the deb installs, unlike the helm chart [18:25:26] oh nice! ya might just disable security i suppose [18:31:50] q: does anyone have a quick pointer or explainer for how the section pointers are identified and when they show up in Special:Search results? Example: https://en.wikipedia.org/wiki/Special:Search?prefix=Wikipedia%3ATeahouse%2FQuestions%2FArchive&search=how+do+I+add+an+infobox%3F [18:31:58] or same example via API: https://en.wikipedia.org/w/api.php?action=query&list=search&srsearch=how+do+I+add+an+infobox%3F+prefix:%22Wikipedia:Teahouse/Questions/Archive%22&format=json&srprop=sectiontitle [18:32:16] isaacj: those are provided by the ParserOutput as the list of headers iirc [18:32:25] or we might extract from html, doubl checking [18:34:02] https://www.irccloud.com/pastebin/ygcPR1sV [18:34:14] isaacj: yea it comes from the parser output, see https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/refs/heads/master/includes/Content/WikiTextStructure.php#82 [18:35:42] Hello! I’m Nik Gkountas, senior software engineer for the Language and Product Localization team. I have a question regarding wikimedia search, that you may be able to help me with. [18:35:42] I have noticed some unexpected behaviour regarding the "articletopic" search keyword and "query" Action API endpoint. More specifically, for the "Liquor" page in en wiki (pageid 1318497) I can see that it has the "classification.prediction.articletopic/Culture.Food_and_drink|977" weighted tag (https://en.wikipedia.org/w/api.php?action=query&format=json&prop=cirrusdoc&titles=Liquor&formatversion=2). [18:35:42] However, a request to the query endpoint with articletopic and pageid filters set to food-and-drink and 1318497 respectively, returns no results (request URL: [18:35:42] https://en.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&prop=langlinks%7Clanglinkscount%7Cpageprops&lllimit=max&lllang=el&generator=search&gsrprop=size&gsrnamespace=0&gsrlimit=max&ppprop=wikibase_item%7Cdisambiguation&gsrqiprofile=classic_noboostlinks&gsrsearch=articletopic:food-and-drink%20pageid:1318497). [18:35:43] Could you please help me understand why the "Liquor" result is not returned in this case? I have successfully combined "articletopic" and "pageid" search keywords in other cases, but this one seems weird. Thank you! [18:36:05] nikG__: sure i can look, sec [18:39:40] nikG__: so the curious thing is that usually these have spaces, but somehow the indexed food-and-drink tag on that page has underscores. Could you file a ticket? [18:39:41] ebernhardson: thanks! do you have a sense of what triggers their inclusion? just if there's explicit keyword overlap with a section title or is there some section-level ranking happening on these pages when the page is sufficiently large? because I don't see it in every case [18:40:25] isaacj: comes from the search highlighter, it's essentially a partial lexical match after some normalization [18:40:51] isaacj: no section level ranking, annoyingly the section link and the highlighted text have no relationship [18:42:27] yeah, that's fair. i figured it'd be pretty expensive to do the actual ranking on the fly. for the record it's a very useful feature in these archive pages which have tons of non-related sections :) [18:44:00] we've been pondering if we should deal with that, some upcoming work on the shape of the text content in the search index makes it possible for us to tie the section link/highlighted text together, but will be months to update the search indexes before it can be applied. [18:47:28] yeah, i've been playing with some natural-language search for these help/policy pages because a lot of them are a beast to wade through/discover because of how diverse+large+technical they can be. and while I think the natural-language search is quite helpful, I assume most of the improvements I see anecdotally are largely just a function of indexing by section as opposed to by the full page. I have a simple tool I built for some [18:47:28] side-by-side comparisons but I need to add the section links to the keyword-Search side to more fairly represent current functionality. example: https://wiki-topic.toolforge.org/search-help?query=can+I+add+a+link+to+twitter%3F [18:57:36] (just added section links) [19:16:49] indexing by section is a hard ask though, at least within opensearch it's a complete up-ending of both the UI and the index structure. [19:42:39] ebernhardson CR for fixing the GPG key if you wanna take a look https://gerrit.wikimedia.org/r/c/operations/puppet/+/1207195 [19:45:16] inflatador: sure, looks reasonable [19:45:35] * ebernhardson just learned you can shift click and drag down to zoom in the grafana y-axis [19:53:56] damn, it's still complaining: `reprepro --component thirdparty/opensearch3 checkupdate bookworm-wikimedia [19:53:56] Error: unknown key '39D319879310D3FC'!` [19:54:09] inflatador: i think this will do what we need with relforge, if we could reimage that would probaby be best but could probably live with just applying the role: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1207930 [19:54:09] I guess I'll double-check the opensearch2 repo key too [20:00:02] 👀 [20:01:44] crap, I have a medical appointment. Back in 90m...my main ? is, 1)will this take away your SSH access and 2) are you OK w/that? [20:11:21] inflatador: it should maintain ssh access, at least thats my intent :) The bit of hieradata that says `profile::admin::groups: elasticsearch-roots` should take care of it [20:44:20] somethings up with articletopic tags. They should always have spaces, but https://en.wikipedia.org/wiki/Liquor?action=cirrusdump has classification.prediction.articletopic/Culture.Food_and_drink|977 with underscores, which doesn't end up matching [21:04:56] apparently not new. the 20250921 snapshot has ~36M pages with tags that have underscores. But none of the items in the ArticleTopicQuery mapping include underscores [21:11:38] apparently this has been going on for a very, very long time without being noticed :S Apparently we aren't pruning the old cirrus_index_without_content dumps, pulling 20241103 there ew still 27M pages with tags containing underscores [21:11:50] or i'm mixing something up :P [21:14:38] oldest dump we have is 20230521. And it has 6M with underscores :S [21:30:50] hmm, so 100% the data coming over events has underscores, so it's not like we accidentally mutated it somewhere. Whats the fix? We issue term queries so it's not fixable via analysis chains. I guess we could query both, and add a normalization in the updater so eventually it doesn't need both [21:36:18] back [21:37:00] ryankemper ^^ just a heads-up that we'll be reimaging relforge to a test role shortly [21:37:25] Ack! [21:45:33] ebernhardson ryankemper I merged the puppet patch and ran it on relforge1008 w/out errors. Looks like we won't have to reimage after all ;) [21:45:56] inflatador: if it's not a lot of trouble, it would probably be easier to start from a clean slate. But i can work with this [21:47:03] ebernhardson oh yeah, I'm fine w/reimaging if you prefer. I thought you didn't want to wait. But if you're cool w/a reimage I'll kick 'em off [21:47:22] I think it would be good to get on trixie or at least bookworm anyway [21:47:28] inflatador: i mean we are trying to be a bit faster than january, but an extra day or two is no problem :) [21:48:01] sure [21:54:25] inflatador: out w dog now while there’s a slowdown in the rain, should be back 2:30 [21:56:14] ryankemper ACK, I should be around [22:28:31] inflatador: 1' [22:44:15] OK, relforge reimages are in-flight. Should be ready in ~45m assuming everything goes right [22:44:18] thanks! [22:53:57] opensearch 3x and 2x packages are now published in our repos [23:26:06] 8 and 9 failed reimage :( [23:26:28] or maybe it just takes a few tries [23:38:20] yeah, I deliberately stopped the 1st three when I realized we need bookworm, not trixie. The other ones...I dunno [23:41:18] OK, 1010 is up but I'm not sure the other ones will be done today. ryankemper you might wanna try again right before you leave but I'm not optimistic. We may need some I/F help ;)