[13:55:03] \o [13:55:16] :S all the reindexers are stuck "timed out waiting for the container to start" [14:14:09] it seems we wait for up to 5 minutes for the pod to start, if it hasn't started then we get stuck :P [16:02:48] I have a community-wishlister asking about ordering results by levenshtein distance from the beginning search string to the article title. I did some poking around and it looks like this would need to be a script_score function or a java plugin [16:03:17] ... and those look very computationally complex [16:04:16] cormacparle: hmm, yea my first guess is it could be pretty expensive. [16:04:42] the difficulty is when the engine says `results 1-20 of 1,234,567` it had to run the score for the 1.2M results [16:05:00] hmmm yeah [16:05:13] https://www.baeldung.com/java-levenshtein-distance ... looks like the best solution is O(m*n) where m and n are the lengths of the strings being compared [16:05:38] cormacparle: for titles, worst case is 255, average is probably much shorter [16:06:03] although on the upside, as the prefix gets longer the number of matches gets smaller [16:06:36] i dunno...if there wasn't too complicated of a way we could test by essentially running bad queries in prod and see what happens. That's how we evaluated things like random sorts [16:08:03] I guess the question is ... is it worth the engineer time spent? [16:08:56] I might just put the user off for now - fixing the fuzziness might be adequate for their needs [16:09:54] cormacparle: i should also note those two things would be separate, fuzziness in the completion suggester works with pre-calculated scores. It doesn't do any live scoring [16:10:13] yeah understood [16:10:51] i suspect we are better off without a levenshtein sort, tbh [16:11:19] completion kinda/sorta does that by issueing a fuzzy and not-fuzzy query, and discounting the results from the fuzzy side, but it's not exactly the same [16:11:31] yeah - seems like a lot of work for very little reward [16:11:40] I'll try and talk the wisher out of it [16:48:44] Trey314159: i'm not sure what exactly yet...but something is wrong with the new analysis chains on wikidata. [16:49:05] ebernhardson: what are you seeing? [16:49:15] Trey314159: The reindex on wikidata fails, it blows up inside lucene while iterating a token stream, fails with: [16:49:17] startOffset must [16:49:23] be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=18,endOffset=19,lastStartOffset=19 for field 'labels.aj [16:49:30] Trey314159: so i don't have a repro for you, but something in ja is busted :( [16:49:42] it was doc id 20549268 in wikidata [16:50:51] Ugh. I will look. I assume it is Sudachi. [16:51:52] Trey314159: i think that would be the ja section here: https://www.wikidata.org/wiki/Q18991708?action=cirrusdump, either `ウィキメディアのテンプレート` or `Template:LSM(R)-501級ロケット中型揚陸艦` [16:52:06] i guess the failure was in labels.ja, so the second one [16:52:15] thanks [17:03:16] of course..my attempts to recreate with the _analyze api are not getting failures :S [17:05:23] same [17:06:50] Trey314159: well, if i use the real index i get errors, but not using analysis components from the fixtures [17:07:34] err, no [17:17:06] Trey314159: curl https://cloudelastic.wikimedia.org:9243/wikidatawiki_content/_doc/20549268 | jq ._source | curl -XPUT -H 'Content-Type: application/json' https://cloudelastic.wikimedia.org:9243/wikidatawiki_content_1751294708/_doc/20549268 -d @- [17:30:02] Trey314159: Q9429269 [17:30:15] Trey314159: Category:User P検準1級 [17:32:27] Trey314159: Category:LST-1級戦車揚陸艦 [19:52:06] have to take care of some things, heading out early today [23:16:35] There's a bug-report about Search-Ahead not bringing up the expected result (with some but not all people able to reproduce it) at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#%22United_States%22_in_search_box