[00:08:54] restart on second batch, all quiet so far [01:27:56] yup codfw completed without issue [01:29:48] But yeah so summarizing the day, we proved what we already suspected which is the issue is only occurring in eqiad [01:30:01] I'm hoping with some digging we can find an obvious explanation like cross connectivity issues between masters or something [13:09:56] ryankemper agreed, one thing I remembered last night is that we did have some network issues with rows E and F in eqiad, ref T393911 . I was wondering if that could be a factor [13:14:09] T393911: Figure out why OpenSearch operational scripts frequently fail to connect - https://phabricator.wikimedia.org/T393911 [13:36:48] \o [13:39:01] first draft of autocomplete ab test, might need a little polishing but should be mostly there: https://people.wikimedia.org/~ebernhardson/T397732-completion-auto-fuzziness.html [13:43:34] just repooled CODFW [13:43:37] .o/ [13:56:17] inflatador: (re rows E/F) yeah I think that’s a good thread for us to pull on [13:56:33] I’m in for the first half of today then will be out on pto 2nd half, also out all friday [13:59:23] ryankemper ACK, looks like cirrussearch1094 and cirrussearch1100 are in E/F . But then again, we also have omega/psi masters in E/F. So it couldn't be just a network issue [14:00:37] although I guess it could be rack-specific [14:18:29] hmm, can we do something like put a script on all the hosts that attempts to open the inter-node ipc port with netcat or similar on all machines? Seems like it might not be too hard of a script, and can be run with cumin or something [14:18:59] essentially directly test all the connections between all the hosts (n x m) [14:23:20] yeah, that seems doable. I still don't think it's a network issue, but we should definitely rule it out [14:23:56] the omega/psi thing is curious...i agree it's not a strong contender [14:31:23] I guess I should be keeping better track of which host is the master before and after a master restart [14:31:43] * inflatador wonders if our exporters keep track of that [14:33:42] hmm, i imagine it must [14:34:52] although, in a quick ctrl-f on the elasticsearch_exporter repo, i'm only seeing 'master is stable' [14:34:53] https://github.com/opensearch-project/opensearch-prometheus-exporter here's the docs for the official opensearch exporter. Looks like we use https://github.com/prometheus-community/elasticsearch_exporter [14:35:32] i guess a separate, mildly silly, method would be to look at metrics only the master reports. They probably have a hostname field [14:35:42] and the version we use is from 2019 ;( https://github.com/prometheus-community/elasticsearch_exporter/releases?q=1.1.0&expanded=true [14:35:58] inflatador: are we meeting? [14:36:06] Trey314159 damn, sorry! BRT [14:36:06] i dont know that prometheus_exporter does that, but we have a custom python collector for additional metrics i wrote that does a few master-only metrics [14:36:17] i'll stop distracting you :) [14:46:06] * ebernhardson re-reads ab report...and realizes the copy needs plenty more work :P maybe being a bit too hard calling everything practically unchanged. A success rate increase of 0.4% is something like 4M queries a year [14:46:31] s/queries/autocomplete sessions/ [15:07:19] inflatador: sre retro starting now https://meet.google.com/rcb-kbfo-rfx [15:07:34] there will be only 4 of us counting me and you so not sure if it will actually happen or not but figured i'd give ya a heads up [15:09:13] ryankemper brt [15:13:31] ebernhardson: should I read the A/B report now or wait for an update? [15:13:47] Trey314159: it should be mostly there, but the conclusions probably need some work [15:13:58] the graphs and numbers should all be right, probably [15:14:18] will look today [15:22:34] you could make two different versions of the conclusion and then A/B test them to decide which is better :P [15:36:02] OK, so it looks like the stock elasticsearch exporter is at :9108/metrics for chi, and :9120/v1/metrics for the custom exporter [16:09:04] Also created T400389 so we can look into the OpenSearch exporter at some point in the future [16:09:05] T400389: Consider replacing Elasticsearch exporter with stock OpenSearch exporter - https://phabricator.wikimedia.org/T400389 [16:10:42] FWiW, as Erik said, there are labels that only appear on master eligibles, but I haven't found any so far that only appear on the active master. Still checking [16:25:20] lunch, back in ~60 [17:49:58] back [18:20:40] Trey314159 according to https://github.com/prometheus-community/elasticsearch_exporter, the Elasticsearch exporter we're currently using comes from justwatch.com! [18:30:10] inflatador: neat! small world [18:30:55] Yeah, fun coincidence [18:31:20] not directly related, but https://github.com/prometheus-community/elasticsearch_exporter/blob/master/collector/health_report.go#L200 suggests there must be a way to figure out recent masters [18:48:08] ah, apparently this is a newer elastic-only feature https://github.com/prometheus-community/elasticsearch_exporter/pull/1002 [18:50:16] Worst-case scenario, we could probably add something to our custom exporter with this info, although I dunno if that's out of scope for what it currently does [20:00:15] do we have a functioning logstash dashboard? All I've found is https://logstash.wikimedia.org/goto/d840f0f212bf1eddd80f15f64ba67360 and it doesn't work. I've opened T395571 but if I'm missing something LMK [20:00:16] T395571: Verify/fix Logstash pipeline for Search Platform-owned OpenSearch clusters - https://phabricator.wikimedia.org/T395571 [20:04:35] looks like `elected-as-master` appears in the logs when there's a master change or maybe just an election [20:21:44] sigh...comcast decided i didn't need internet for like 4 hours. fun :) [20:23:03] (╯°□°)╯︵ ┻━┻ [20:25:36] on cirrussearch1100 at least, there seems to be a day's gap between log.2.gz and log.1 ? log.2.gz ends at 2025-07-22T23:17:20 and log.1 starts at 2025-07-23T23:01:46 [20:34:19] https://www.elastic.co/docs/troubleshoot/elasticsearch/discovery-troubleshooting [21:14:52] * ebernhardson realizes an `Autocomplete Redirect Rate`, that measures the go-effectiveness might be interesting, but the data is awkward