[10:13:27] lunch [13:11:33] \o [13:16:47] .o/ [13:18:28] o/ [13:32:28] dcausse: i was working up the bit about loosening types to Collection, but it makes me wonder. Should the RegExpInfo still require an ImmutableList? It can be a Collection, and we could then return LinkedHashSet or something where appropriate, but i wonder about communicating intent [13:33:55] i suppose the thought is Collection might be more efficient, but ImmutableList would make it clear we expect them to not be mutated [13:33:57] ebernhardson: it's where I did not get the need for LinkedHashSet, in what case do you actually need to keep the order? [13:34:38] the type does not have to be "ImmutableList" imo, it's just that it's constructed to be immutable, if it gets mutated this should be covered by the tests [13:37:12] hmm, i suppose it could probably be a plain HashSet instead of linked, the purpose is deduplicating. [13:38:33] ok I think this is why I was to use Collection so that you can feed List<> or Set<> [13:38:43] s/was/was suggesting/ [13:39:00] i'll have to double check how the sorting bits work, but on quick review it looks like in the source algo sorting was primarily about them not using a real set implementation and needing a dedup [13:42:50] if the purpose is extracting trigrams that will end up in a boolean expression I don't think the order can matter? but I could be completely wrong... [13:43:05] no you're right, for the output the order doesn't really atter [14:12:32] some activity on https://phabricator.wikimedia.org/T239950 i suspect it isn't too hard to do, but we have to pin down what exactly is to be done [14:12:42] (ticket from '19) [14:21:14] dcausse: thanks for getting the dwim click data so quickly! The much bigger skew is very interesting... maybe dwim will make hiwiki a little more usable. [14:21:47] Trey314159: indeed! [14:22:20] ebernhardson: yes just discovered recently we index all statements even deprecated ones and was a bit surprised [14:23:26] i'll move it back to the inbox [14:24:45] sure [14:36:52] pulled the sorting out and ran 50k rounds of the soundness test, it at least looks like should work [16:01:00] dcausse: ebernhardson: gonna make another small fix in the translate plugin, because ttmserver-export.php is still not working, https://phabricator.wikimedia.org/P92612 [16:01:07] hiii [16:02:05] atsukoito: hi! [16:04:35] mw-config has replicas => 1, that's odd [16:05:06] yeah, it has this number for all the clusters as well [16:05:25] has been like that since 2014 [16:05:34] 1 is a bit dangerous [16:05:42] should be 2 I think [16:06:43] has to be '2' i think? [16:06:49] yes [16:06:51] as a string, although i haven't fully traced it [16:07:15] prod has 2 replicas so we must have changed that manually on the live index [16:07:34] tl;dr getReplicaCount does config['replicas'] ?? '0-2' [16:08:08] so it's fine to drop 'replicas' entirely or set it to to '2' [16:08:47] dcausse: and the field seems to be used only here, so the config change would be enough [16:08:53] also wondering if bumping the number of shards might not help with translate recurring issues... but that's completely unrelated [16:09:04] atsukoito: yes [16:10:25] atsukoito: unrelated to the replica count but ultimately you should not use the discovery endpoint for writing, the "writable" translate services must point to the actual endpoints [16:13:57] hmm, i broke the test `(aa|ab|ac|ad|ae|af|ag|ah|ai|aj|ak|al|am|an|ao|ap|aq|ar|as|at|au)bc` -> abc...but mostly there :) [16:13:58] very similar to what's done now, the default non-writable hits the dnsdisc endpoints and the writable ones 'eqiad' and 'codfw' hit their direct service endpoint [16:14:12] :) [16:44:06] dcausse: thanks, I'll add dnsdisc and direct endpoints [17:08:32] atsukoito I believe the operator-created clusters default to 1x replica per shard, we want 2. But there's no way to set it in the charts, that is a design choice by OpenSearch. It's something we need to be able to solve at scale, ref T424860 among other tasks [17:08:33] T424860: Consider managing OpenSearch cluster dynamic settings with terraform - https://phabricator.wikimedia.org/T424860 [17:29:13] I'm undecided with how haswbstatement should handle *, it's currently only valid at then end of the query to enable a prefix query, should I simply run wildcard query when * is found somewhere or continue with some special cases and allow haswbstatement:P735=Q12344159[P1545=*] to run a prefix query... [17:29:28] hmm [17:29:51] both are going to have a problem with term expansion? Although hopefullly this use case wont run it it [17:31:10] yes... it's definitely far from ideal but not sure we want to have a real prefix index on this field [17:32:32] i suppose i don't have a strong feeling. My suspicion is wildcard is probably enough. A quick google suggests the limit is the BooleanQuery max clauses but not 100% sure [17:33:25] i suppose one worry about wildcard is allowing "too much", we wouldn't want users issuing haswbstatement:P7* [17:33:30] yes both prefix and wilcard will have the same limit, just wondering if the term expansion is much more optimized on the prefix query than the wildcard [17:34:16] they'll hit the 1024 term limit silently [17:34:41] it's also not working at the moment so I suspect no one is relying on this [17:34:52] you have to lowercase manually your query to make it work [17:34:59] hmm [17:35:00] https://www.wikidata.org/w/index.php?search=haswbstatement%3Ap735%3Dq123441*&title=Special%3ASearch&profile=advanced&fulltext=1&ns0=1 [17:36:30] oh i hadn't thought about how prefix skips analysis, certainly a bit awkward although maybe in this usecase it can work [17:37:24] I have a patch to switch to a normalizer that should make it work as expected [17:37:34] but the term expansion problem remains [17:37:36] i suppose i don't know how to estimate the cost of prefix vs wildcard. Naievly i would probably try and generate some of the more-expensive variants and run them on prod clusters to get an idea [17:37:51] sure [17:40:51] dcausse: it looks like if we swap rewrite to constant_score_boolean instead of top_terms_1024 we get a fail instead of a response. not sure if that's really better though [17:41:46] yes, was tempted to add a warning everytime a prefix/wildcard is used but that seems odd to have a feature that always emits a warning :/ [17:42:36] can we use the "degraded query" functionality to re-run the query after a fail and include a warning? [17:42:49] i don't remember what that really does, we haven't touched it in a long long time [17:43:10] it should fail pretty fast since it's in the query building faze, although it might be per-shard [17:44:52] yes... I'll take a look, this degraded query mode is not something I've looked in a while :) [17:57:37] there's constant_score it has no limit, does not seem to collect all terms in one go (does it on a per segment basis) [17:58:14] probably very slow on pathological wildcards or highly ambiguous prefixes [17:58:46] will test a bit [17:59:21] dinner [18:53:19] surprised to say, claude is also reasonable at finding minor perf issues. Found a few unnecessary allocations, things that can be short-circuited (if (a.isEmpty) { return b } style things), etc. [20:42:55] * ebernhardson notes while checking for x-forwarded-for logs after the deployment, that someone is pasting nginx configs into the search box :P [20:43:18] oddly, not seeing the x-forwarded-for added to testwiki logs :S I would have expected them for most things [21:00:02] will double check in a bit, see what i find in the hourly logs in hadoop [21:55:53] yup they are in the hadoop logs, running out of time today but we should have data to look at tomorrow