[08:00:22] pfischer: since you're out today, I'll send the weekly status update [08:09:42] I'm also closing a bunch of tasks in our previous milestone that were put in the Done and then Reported column, but the status was still "open" [10:20:27] lunch [13:56:40] We should keep an eye on T406276 [13:56:41] T406276: Conduct evaluation of Semantic Search and Q&A prototype output - https://phabricator.wikimedia.org/T406276 [14:21:55] \o [14:28:48] o/ [14:58:56] suppose i haven't tried in awhile, but looks like we can now ask for yarn instances with >40g memory (iirc it used to limit at 32? but maybe my memory is bad) [15:27:48] meh, need to adjust the naming...model names are currently {wiki}-{date}-{source_feature_set}, and i didn't change any of those, so upload fails as duplicate [15:35:08] still terrible :( https://en.wikipedia.org/w/index.php?search=what+event+triggered+WW2%3F&title=Special%3ASearch&profile=advanced&fulltext=1&ns0=1&cirrusFTQBProfile=perfield_builder_relaxed&cirrusMLRModel=enwiki-20250918-hard_negatives [15:35:26] sigh [15:36:26] the retrieval query is particularly bad too: https://en.wikipedia.org/w/index.php?search=what+event+triggered+WW2%3F&title=Special%3ASearch&profile=advanced&fulltext=1&ns0=1&cirrusFTQBProfile=perfield_builder_relaxed&cirrusRescoreProfile=empty [15:36:41] wondering if WW2 related pages are even in the mlr rescore window [15:37:53] hmm, yea not sure [15:39:12] WWII -> "world war 2" makes a huge difference [15:39:29] mostly title matches for "world war" pushing it up i imagine [15:39:45] yes... [15:39:48] oh, wait, silly me...changing things changed the profile params too :P [15:41:00] but yea...it's focusing too much on "triggered" and not on "ww2", i suppose "what event triggered ww2?" is also 4 tokens, so has to match 3 of them [15:41:12] "world war 2" expands the token count and gets 50% match [15:42:16] you have to go to offset 5000 to see https://en.wikipedia.org/wiki/World_War_II [15:42:24] !! [15:42:39] we rescore like 450 per shard [15:42:45] so thats out of the window [15:42:53] i guess we do the popularity rescore first though [15:42:55] yes that one is not even seen [15:43:07] and haven't seen the "Causes of WW2" yet :( [15:43:57] looks like we rescore top ~57k with popularity/incoming links, and then 3.1k make the MLR rescore [15:45:10] the retrieval query gives way to much weight on the title I think [15:45:44] will spend some time on this next week [15:45:50] but we don't have great tools for this :( [15:46:15] thanks! yea relforge doesn't do quite what we need for that, we almost need to be able to sample a bunch of queries and hits, and find the position of the hits before each rescore [15:46:21] s/hits/clicks/ [15:46:46] time to start the weekend. Have funn! [15:46:50] see where our docs even end up, and how they move as the profile changes [15:46:52] have fun! [15:47:16] yes... [15:48:01] i also wonder sometimes if the 8k rescore window on popularity is too large, but thats kinda tangential. But it seems plausible the 56,000th result can't possibly get enough boost to make the top 3k [15:48:01] perhaps we could get a list of query clicks where tokens(query) > 5 or so [15:48:11] might help a bit to tune [15:48:47] yea that should be possible, for counting tokens recent i wrote a small python udf that just hit the _analyze api and counted the tokens returned [15:48:58] I mean tune in a way to get these clicked page in a reasonable position in the first rescore stage [15:48:59] can give it an index name and field name [15:49:45] yea makes sense [15:49:56] now that we can query elastic directly perhaps something doable [15:50:25] but well.. will have to have a query template adjust the weight in a brute force manner :( [15:50:54] I'm not even sure that the shape of the query is right at the moment :/ [15:51:07] well, that means plenty of opportunity :) [15:51:44] yes true, but super hard to automate&discover better structure weights... might need a bit of manual testing [15:55:36] how many queries/sec were you able to send in eqiad? just to get an idea if a brute force grid search could possible [15:55:53] dcausse: well, eqiad was idle so i was using 600-800 threads [15:56:00] oh right [15:56:12] the defaults in mjolnir msearch use about 40 threads per cluster [15:56:39] we need an idle cluster all the time :) [15:56:42] (i restarted the eqiad daemons with higher limits, then stopped the codfw daemons) [15:57:06] and the limit seemed to be more on the search-loader side, it had all 4 cores at 80%+ usage [15:57:17] ah, I thought you had shipped queries directly to the cluster? [15:57:30] i did for part, but for things like feature collection that all runs through msearch [15:57:32] daemon [15:58:07] tbh, we can make a cluster mostly idle via etcd [15:58:28] if it's not a big effect on end user latencies, we can move traffic for a few hours and hammer it with queries from hadoop [15:58:52] sadly i think it still requires sre powers to do that though [16:00:08] here I guess I'm not too interested in features but rather a kind of very simple f1 score I think so I might ship queries directly, but yes, need to be careful if shipping directly from hadoop [16:00:17] might test in relforge first :) [16:00:30] probably reasonable, just a bit slower [16:08:00] i suppose i'll try and get a better sample of how this is doing today, ought to be able to take some number (50?) of queries from the natural language queries bit and run them both ways, start figuring out how to get relforge going again [16:08:10] we were talking about getting relforge working again anyways [16:11:42] indeed [16:12:00] if somehow you could also get some that gets a click that might help I think [16:12:45] hmm, i didn't track that in the original dataset but maybe i can join it against query_clicks and see what we have [16:29:41] * ebernhardson is not certain we should continue with the bits where relforge ssh's around to configured places [16:30:15] but i guess the problem there is having an appropriate cirrus instance, so maybe we have to [16:31:01] yes... of if we assume we don't test crazy syntax perhaps some ways to extract a template from cirrus might be nice? [16:31:06] s/of/or [16:31:46] yea maybe we simple expect everything to be a bag_of_words query, and coax a template out of cirrus for it [16:32:34] yes, in an ideal world we'd ask cirrus to dump a kind of templates with all the variables templated (weights included) [16:33:46] indeed that would be nice [16:34:12] i suppose in a way thats what the wbsearchentities explain-unifier does. But never made that work on cirrus queries [16:34:35] might be tedious to write tho... when building the query in cirrus we don't much care about what could be variable and what's not [16:34:56] yea and it's spread all over [16:35:41] i'm thinking some way to export a template from cirrus to a file, can manually edit those templates, and a way to point at the template from relforge runner [16:35:49] yes [16:36:11] sad we never got to make use of elastic search templates :( [16:37:08] yea, our query building has always been quite complex, i wonder if the ast->query transformation could simply apply nodes to templates for most things, but for another day [16:37:09] Is there a similar feature in OpenSearch? [16:37:23] yes should be, search templates are an old feature [16:39:42] heading out, have a nice week end! [16:39:47] enjoy! [16:53:34] hmm, random thought while writing up a basic plan for relforge, is that mjolnir already has a bit that does distributed msearch in hadoop. It takes a python function to template the queries. Should perhaps try to reuse bits [17:05:03] maybe could also massively simplify the runner by storing the full http response, instead of trying to parse out specific details. That still happens later, but would mean we don't have to re-run queries when it does something wrong. [17:05:32] like just make a parquet file with (run_id, query_string, total_hits, json_string) or something