[06:00:41] o/ [06:35:48] o/ [07:06:43] pfischer: o/ when have a moment could you take a look at https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1183769 (to unblock CI) [07:57:15] dcausse: sure, looking [07:57:27] thx! [08:39:15] dcausse: done. Is NonNormalizedAnnotation wmf-internal codesniffer magic? code search does not bring up any implementation details. [09:02:59] dcausse: 1:1 ? [09:03:07] gehel: sorry a bit late [09:03:12] np [09:30:15] pfischer: I think so? it's in https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/tools/codesniffer/+/refs/heads/master/MediaWiki/Sniffs/Commenting/FunctionAnnotationsSniff.php#70 [09:52:44] pfischer: would you have time this afternoon to discuss https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1644? The more I look at what's already scheduled the more I believe there's a fundamental problem here [10:00:26] lunch [12:01:46] dcausse: sure, I’ll schedule a meeting [12:25:59] pfischer: thanks! [12:35:03] pfischer: sent an invite, please feel free to change the time [13:30:30] Welcome back gehel ! [13:30:56] o/ [13:34:30] o/ [13:49:42] \o [13:50:12] .o/ [14:00:25] o/ [14:05:27] Just a heads-up that I created T403301 last week to talk about Search and DPE SRE maybe taking a more active role in maintaining the opensearch deb pkgs. If y'all have any feedback feel free to add to the ticket [14:05:28] T403301: Discuss OpenSearch 3 roadmap/future improvements - https://phabricator.wikimedia.org/T403301 [15:46:21] looking at phab board..realized we shipped the changed japanese analysis chain with last weeks train, kicked off another reindex (but it should skip most wikis that don't have ja analysis chain) [15:46:52] decided to not wait for commons/wikidata since this fixes a problem with bad content failing the indexing pipeline [15:47:10] (usually we would just do commons/wikidata a couple times a year and let changes build up) [15:51:06] sure [15:54:19] Thanks, Erik! [15:54:19] also had a ponder on auto_expand_replicas, i don't really want to have the code vary between number_of_replicas and auto_expand_replicas, seems like unnecessary extra paths. Think i'm going to simply configure wmf-config with ranges like "2-2" [15:54:38] could have cirrus magic 2 into 2-2, but not sure it's worthwhile as well. Can just configure it "properly" [15:55:55] ebernhardson: perhaps not even worth the effort to make them 2-2? I'm not even sure that opensearch has optimization for this? [15:57:50] dcausse: well, it rejects a plain `2`, so we have to provide a range [15:58:16] dcausse: otherwise we'd have to maintain code paths for number_of_replicas and auto_expand_replicas (for example when reindexer sets 0 replicas while reindexing) [15:59:48] ebernhardson: sure, but I wonder what it would bring to do "2-2" vs "0-2" [16:00:07] * dcausse looks at opensearch codebase [16:01:02] dcausse: ahh, well i suppose the concern is that "0-2" allows it to remove replicas. It probably wont, but we never want it to have only 1 replica [16:02:00] my main idea was to skip the "adaptAutoExpandReplicas" step in opensearch [16:02:15] workout, back in ~40 [16:03:37] hmm, so essentially it loops over all the indexes, decides the desired number of replicas, then changes as necessary. Indeed this would continue to regularly check [16:04:15] but it seems to run OK if we don't have too many banned nodes [16:05:00] but if the range is a zero-width range it will always return no change, it's only that it will have to regularly re-calculate the no-change [16:06:40] IIRC the slow part was the regex on the banned node list, wondering when this happens [16:11:31] hmm, not entirely sure but poking around it does indeed seem like it would be part of cluster deciders [16:11:55] in DiscoveryNodeFilters.match [16:12:26] yes I think it loops over all the nodes before looking at min/max [16:12:40] so indeed, if we want to skip that we would need to move away from auto_expand_replicas, it will still regex match all nodes * all indices [16:12:55] didn't realize that was the goal, we can do that in cirrus i guess [16:13:26] ebernhardson: or we can ignore, I guess we need to find where the sweet spot is in term of the number of banned nodes [16:14:12] it was a list of 39 nodes, not huge but particularly high for "normal" operations [16:15:05] hmm, it's a regex, maybe we just need to use 1 regex instead of 39? [16:15:39] indeed, most likely is the compilation of the regex that causes the slowness [16:15:59] yea i was a little surprised to not see a regex cache [16:16:01] it's building a new lucene automato on every call to this function [16:16:05] just compiles it on-site [16:16:11] yes... [16:18:07] current main branch no better, basically same [16:18:34] I looked at the cirrus code to see how to dynamically switch between number_of_replicas and auto_expand and I agree with you, found this particularly tedious and probably quite error prone [16:19:16] it's not super hard, but it seems like unnecessary complication and opportunity for error [16:19:32] yes [16:20:26] maybe just an alert when we detect more than 5 nodes in the ban list? [16:22:41] seems plausible. Using automation to make them all into 1 regex seems tedious as well (example: updating existing list) [16:26:15] yes [16:29:41] Brian filed T399900 perhaps that would be just enough? with this we would have flagged these 39 banned nodes a lot earlier I guess [16:29:41] T399900: OpenSearch: Create verification scripts for common operations/fight config drift - https://phabricator.wikimedia.org/T399900 [16:30:26] hmm, yea that will work too [16:32:13] anyways, totally fine by me to simply decline (or move back to the backlog) T402627 [16:32:14] T402627: Stop using auto_expand_replicas on indices hosted by the cirrussearch cluster - https://phabricator.wikimedia.org/T402627 [16:55:18] hmm, should the dumps filenames have the snapshot date in them? [16:55:39] it's in the directory path, but maybe it should make it into the filenames as well to make them more individually identifiable [16:58:59] hmm, yes almost certainily [17:12:06] back [17:21:24] dinner [18:07:16] lunch, back in ~40 [19:15:05] back [19:15:45] hmm, i can make the prod enwiki_content solve `tayps of wlding difacts`, but the query (with main query as match_none) takes 450 ms :S [19:16:19] prefix length from 2 to 1 seems the most impactful to timing, but also required for fixes like wlding -> welding [19:17:23] * ebernhardson is separately unclear on best way to thread through the profiles, is a little tedious to duplicate 3 prod profiles into 9 to test variations [19:31:05] I think `tayps of wlding difacts` might deserve a t-shirt :) [20:58:45] ryankemper I was thinking we could look at T403534 today, but if you have anything else LMK [20:58:45] T403534: Add ipoid-opensearch namespace to dse-k8s Kubernetes clusters - https://phabricator.wikimedia.org/T403534 [21:18:44] I got the official prometheus opensearch exporter working, here's a list of its metrics in case anyone wants to compare/contrast with what we currently get: https://phabricator.wikimedia.org/P82431 .