[10:13:27] <dcausse>	 lunch
[13:11:33] <ebernhardson>	 \o
[13:16:47] <inflatador>	 .o/
[13:18:28] <dcausse>	 o/
[13:32:28] <ebernhardson>	 dcausse: i was working up the bit about loosening types to Collection, but it makes me wonder. Should the RegExpInfo still require an ImmutableList?  It can be a Collection, and we could then return LinkedHashSet or something where appropriate, but i wonder about communicating intent
[13:33:55] <ebernhardson>	 i suppose the thought is Collection might be more efficient, but ImmutableList would make it clear we expect them to not be mutated
[13:33:57] <dcausse>	 ebernhardson: it's where I did not get the need for LinkedHashSet, in what case do you actually need to keep the order?
[13:34:38] <dcausse>	 the type does not have to be "ImmutableList" imo, it's just that it's constructed to be immutable, if it gets mutated this should be covered by the tests
[13:37:12] <ebernhardson>	 hmm, i suppose it could probably be a plain HashSet instead of linked, the purpose is deduplicating. 
[13:38:33] <dcausse>	 ok I think this is why I was to use Collection<String> so that you can feed List<> or Set<>
[13:38:43] <dcausse>	 s/was/was suggesting/
[13:39:00] <ebernhardson>	 i'll have to double check how the sorting bits work, but on quick review it looks like in the source algo sorting was primarily about them not using a real set implementation and needing a dedup
[13:42:50] <dcausse>	 if the purpose is extracting trigrams that will end up in a boolean expression I don't think the order can matter? but I could be completely wrong...
[13:43:05] <ebernhardson>	 no you're right, for the output the order doesn't really atter
[14:12:32] <ebernhardson>	 some activity on https://phabricator.wikimedia.org/T239950 i suspect it isn't too hard to do, but we have to pin down what exactly is to be done
[14:12:42] <ebernhardson>	 (ticket from '19)
[14:21:14] <Trey314159>	 dcausse: thanks for getting the dwim click data so quickly! The much bigger skew is very interesting... maybe dwim will make hiwiki a little more usable.
[14:21:47] <dcausse>	 Trey314159: indeed!
[14:22:20] <dcausse>	 ebernhardson: yes just discovered recently we index all statements even deprecated ones and was a bit surprised
[14:23:26] <ebernhardson>	 i'll move it back to the inbox
[14:24:45] <dcausse>	 sure
[14:36:52] <ebernhardson>	 pulled the sorting out and ran 50k rounds of the soundness test, it at least looks like should work
[16:01:00] <atsukoito>	 dcausse: ebernhardson: gonna make another small fix in the translate plugin, because ttmserver-export.php is still not working, https://phabricator.wikimedia.org/P92612
[16:01:07] <atsukoito>	 hiii
[16:02:05] <dcausse>	 atsukoito: hi!
[16:04:35] <dcausse>	 mw-config has replicas => 1, that's odd
[16:05:06] <atsukoito>	 yeah, it has this number for all the clusters as well
[16:05:25] <dcausse>	 has been like that since 2014
[16:05:34] <dcausse>	 1 is a bit dangerous
[16:05:42] <dcausse>	 should be 2 I think
[16:06:43] <ebernhardson>	 has to be '2' i think?
[16:06:49] <dcausse>	 yes
[16:06:51] <ebernhardson>	 as a string, although i haven't fully traced it
[16:07:15] <dcausse>	 prod has 2 replicas so we must have changed that manually on the live index
[16:07:34] <atsukoito>	 tl;dr getReplicaCount does config['replicas'] ?? '0-2'
[16:08:08] <dcausse>	 so it's fine to drop 'replicas' entirely or set it to to '2'
[16:08:47] <atsukoito>	 dcausse:  and the field seems to be used only here, so the config change would be enough
[16:08:53] <dcausse>	 also wondering if bumping the number of shards might not help with translate recurring issues... but that's completely unrelated
[16:09:04] <dcausse>	 atsukoito: yes
[16:10:25] <dcausse>	 atsukoito: unrelated to the replica count but ultimately you should not use the discovery endpoint for writing, the "writable" translate services must point to the actual endpoints
[16:13:57] <ebernhardson>	 hmm, i broke the test `(aa|ab|ac|ad|ae|af|ag|ah|ai|aj|ak|al|am|an|ao|ap|aq|ar|as|at|au)bc` -> abc...but mostly there :)
[16:13:58] <dcausse>	 very similar to what's done now, the default non-writable hits the dnsdisc endpoints and the writable ones 'eqiad' and 'codfw' hit their direct service endpoint
[16:14:12] <dcausse>	 :)
[16:44:06] <atsukoito>	 dcausse: thanks, I'll add dnsdisc and direct endpoints
[17:08:32] <inflatador>	 atsukoito I believe the operator-created clusters default to 1x replica per shard, we want 2. But there's no way to set it in the charts, that is a design choice by OpenSearch. It's something we need to be able to solve at scale, ref T424860 among other tasks
[17:08:33] <stashbot>	 T424860: Consider managing OpenSearch cluster dynamic settings with terraform - https://phabricator.wikimedia.org/T424860
[17:29:13] <dcausse>	 I'm undecided with how haswbstatement should handle *, it's currently only valid at then end of the query to enable a prefix query, should I simply run wildcard query when * is found somewhere or continue with some special cases and allow haswbstatement:P735=Q12344159[P1545=*] to run a prefix query...
[17:29:28] <ebernhardson>	 hmm
[17:29:51] <ebernhardson>	 both are going to have a problem with term expansion? Although hopefullly this use case wont run it it
[17:31:10] <dcausse>	 yes... it's definitely far from ideal but not sure we want to have a real prefix index on this field
[17:32:32] <ebernhardson>	 i suppose i don't have a strong feeling. My suspicion is wildcard is probably enough. A quick google suggests the limit is the BooleanQuery max clauses but not 100% sure
[17:33:25] <ebernhardson>	 i suppose one worry about wildcard is allowing "too much",  we wouldn't want users issuing haswbstatement:P7* 
[17:33:30] <dcausse>	 yes both prefix and wilcard will have the same limit, just wondering if the term expansion is much more optimized on the prefix query than the wildcard
[17:34:16] <dcausse>	 they'll hit the 1024 term limit silently
[17:34:41] <dcausse>	 it's also not working at the moment so I suspect no one is relying on this
[17:34:52] <dcausse>	 you have to lowercase manually your query to make it work
[17:34:59] <ebernhardson>	 hmm
[17:35:00] <dcausse>	 https://www.wikidata.org/w/index.php?search=haswbstatement%3Ap735%3Dq123441*&title=Special%3ASearch&profile=advanced&fulltext=1&ns0=1
[17:36:30] <ebernhardson>	 oh i hadn't thought about how prefix skips analysis,  certainly a bit awkward although maybe in this usecase it can work
[17:37:24] <dcausse>	 I have a patch to switch to a normalizer that should make it work as expected
[17:37:34] <dcausse>	 but the term expansion problem remains
[17:37:36] <ebernhardson>	 i suppose i don't know how to estimate the cost of prefix vs wildcard. Naievly i would probably try and generate some of the more-expensive variants and run them on prod clusters to get an idea
[17:37:51] <dcausse>	 sure
[17:40:51] <ebernhardson>	 dcausse: it looks like if we swap rewrite to constant_score_boolean instead of top_terms_1024 we get a fail instead of a response.  not sure if that's really better though
[17:41:46] <dcausse>	 yes, was tempted to add a warning everytime a prefix/wildcard is used but that seems odd to have a feature that always emits a warning :/
[17:42:36] <ebernhardson>	 can we use the "degraded query" functionality to re-run the query after a fail and include a warning?
[17:42:49] <ebernhardson>	 i don't remember what that really does, we haven't touched it in a long long time
[17:43:10] <ebernhardson>	 it should fail pretty fast since it's in the query building faze, although it might be per-shard
[17:44:52] <dcausse>	 yes... I'll take a look, this degraded query mode is not something I've looked in a while :)
[17:57:37] <dcausse>	 there's constant_score it has no limit, does not seem to collect all terms in one go (does it on a per segment basis)
[17:58:14] <dcausse>	 probably very slow on pathological wildcards or highly ambiguous prefixes
[17:58:46] <dcausse>	 will test a bit
[17:59:21] <dcausse>	 dinner
[18:53:19] <ebernhardson>	 surprised to say, claude is also reasonable at finding minor perf issues. Found a few unnecessary allocations, things that can be short-circuited (if (a.isEmpty) { return b } style things), etc.
[20:42:55] * ebernhardson notes while checking for x-forwarded-for logs after the deployment, that someone is pasting nginx configs into the search box :P
[20:43:18] <ebernhardson>	 oddly, not seeing the x-forwarded-for added to testwiki logs :S  I would have expected them for most things
[21:00:02] <ebernhardson>	 will double check in a bit, see what i find in the hourly logs in hadoop
[21:55:53] <ebernhardson>	 yup they are in the hadoop logs, running out of time today but we should have data to look at tomorrow