[10:16:28] lunch [14:01:50] \o [14:26:45] o/ [15:05:56] dcausse would you like me to take T406656 and do reloads on the categories graph, or would you prefer to keep troubleshooting? [15:05:56] T406656: Reimage failed after prompt...is prompt needed? - https://phabricator.wikimedia.org/T406656 [15:07:05] surprising to me...take the last 100k events from update_pipeline.update.v1, `grep -v TAGS_UPDATE`, how many should be left? I have 2 [15:07:06] inflatador: o/ I'm done troubleshooting, if you could fix the puppet discrepancies and do a reload this would be great :) [15:08:10] ebernhardson: this is totally unexpected? did you scan the codfw topic? [15:08:41] eqiad topic should be almost empty [15:08:48] dcausse: oh!! eqiad has 2, codfw has 99,390. [15:08:51] hmm [15:09:22] oldest event in eqiad is 2025-10-17T00:14 [15:09:36] sorry, newest. [15:09:39] well, both [15:10:07] hmm, i don't know this is related to the existing weighted_tags problem...but maybe another problem :P I guess some batch update shipped through eqiad? [15:10:09] mw primary db is in codfw so all writes should happen from mw@codfw [15:10:39] yes, I think we forcibly push weighted_tags for image_rec to kafka-main@eqiad [15:11:03] should happen weekly on fridays [15:11:03] ahh, yea that would make sense [15:11:14] and 17th was friday [15:11:19] should be it [15:13:13] hm.. the truncate filter we use to fix extremely long keyword fields can't be used as a normalize apparently... option would be to use ignore_above but that might throw away some data rather than keeping some of it... [15:13:39] hmm, it ignores the whole thing? [15:13:56] yea..." If a string’s length exceeds the specified threshold, the value is stored with the document but is not indexed." [15:14:25] well I could check what the impact would be [15:14:50] i guess a pattern replace char filter? But stuffing regex everywhere seems meh [15:15:26] actually I need to understand what they use to determine if a filter qualifies for as normalizer or not, I'm not super clear on this [15:16:19] initially I thought they forced you to use char_filters but looks like you can use token_filter as you said, but not all of them apparently :/ [15:19:00] random claim: a normalizer accepts only filters that are instances of either NormalizingTokenFilterFactory or NormalizingCharFilterFactory [15:21:35] sigh... [15:22:43] as far as i can tell...not having an equivalent truncate is perhaps an oversight...if we don't like pattern_replace i suppose a custom normalizing char filter could work, but i dunno if necessary [15:23:35] ah cool IcuFoldingTokenFilterFactory is NormalizingTokenFilterFactory was mainly afraid of this one [15:24:34] I think I can live with a pattern_replace, but could add a quick custom truncate or upstream a change [15:28:36] actually IIRC we have a fix from Trey about the regex highlighter to deploy, might be a good time to add that custom truncate? [15:31:00] yea there is already a .deb update waiting in the wings, we can add to it [15:31:15] i don't think it's made it to apt.wikimedia.org yet, just sitting in the gitlab CI [15:32:24] ok, pushing a patch hopefully won't take long [15:35:14] ah and the "length" as well we use to remove empty tokens will have to be supported [15:35:59] dcausse ACK, have taken the ticket [15:36:07] thanks! [17:43:58] dinner [20:08:39] I'm trying to load some test data into the new opensearch-on-k8s instance, specially enwikibooks. I was wondering, can I just create the index with the same mapping as the live enwikibooks, or is there stuff I need to add/remove first? [20:10:25] sigh, finally...spent the last two hours trying to figure out why jsondiff.py was blowing up...and the answer is: It doesn't know namespaces and dislikes getting the same title from multiple namespaces [20:10:47] on the upside, refactored parts so it's easier to test :P [20:17:52] that's annoying. Although I will admit, I didn't know json did namespacing [20:19:09] it's not json namespaceing, jsondiff is comparing search results from two json dumps. the results themselves can have either the title or the page id as the primary key, if it's the title then two different namespaces look the same [20:19:19] (we can fix by using the prefixed title, it just doesn't happen to do that) [20:37:59] I got the mapping, asking ChatGPT to remove all the non-standard analyzers isn't too bad ;) [23:10:43] First data in opensearch on k8s! https://phabricator.wikimedia.org/T404907#11296571