[13:04:37] \o [13:07:23] we still owe y'all a cluster restart...since it blew up on Monday, we're going to proactively depool eqiad. Should be done after the Weds mtg [13:14:17] it's alright, things sometimes take a bit [13:14:31] i'm still writing a darned autocomplete ab test report...mostly done but i kinda started over at one point [13:14:47] Ryan and I did some log diving yesterday, we're still not sure exactly why the split brain is happening [13:15:49] my theory from the 7/7 outage (that it was lingering master state from before we removed all the elastic hosts) is shot [13:16:48] If it keeps happening, we may need to update the cookbook to exclude some hosts from voting before we restart them or something [13:17:28] hmm, indeed it's not something we want to allow to regularly happen. My intuition is it has to be from the masters config somehow...but not sure [13:22:44] yeah, we noticed some typos in opensearch.yaml, which is translated to FW rules. But none of the nodes with typos were masters (ref https://gerrit.wikimedia.org/r/c/operations/puppet/+/1171307/1/hieradata/role/eqiad/cirrus/opensearch.yaml) [13:50:10] lol, asked claude what it thought of my report. "established pattern of explaining the metric's importance, the desired direction for improvement, and concluding that the observed changes are practically meaningless despite statistical significance" [13:50:59] LOL [14:06:25] that SearchDigest thing is curious...i have no clue how their dataset isn't completely polluted with garbage. They basically just push all full text search queries into a database: https://github.com/weirdgloop/mediawiki-extensions-SearchDigest/blob/master/src/SearchDigestHooks.php#L33 [14:06:52] i guess they limit to Special:Search and ignore api which might help a little, but we have lots of browser automation happening too [14:08:41] they also simply throw away a few languages, but i guess thats not the end of the world: if ( ! ( $lang == 'ja' || $lang == 'lzh' || preg_match( '/^zh/', $lang ) ) ... [14:10:48] i suppose one other limit is they only log queries that could be valid mediawiki titles, but that's most things [14:16:28] I was also shocked at how clean the data was. I'm putting it down to many fewer & more targeted users, fewer vandals, fewer accidental queries, etc. [14:32:48] hmm, clearly i don't fully understand what its doing though. It seemed like if i repeat the same query a few times it should show up, but not seeing it... [14:39:11] ohh, they also only consider "autocomplete submits", not full text. If you provide `fulltext=1` (which Special:Search always does) it won't log [14:40:04] basically they are limiting to things that invoked the `go` feature but failed [14:43:06] ok yea thats it, i can inject arbitrary queries into it if you know how it works, so there isn't some hidden filtering: https://balatrowiki.org/w/Special:SearchDigest?prefix=dev&from=1752671492 [14:59:31] oh, no! now we're the hackers! [15:49:58] workout/lunch, back in ~2h [16:50:34] inflatador: ping me when you want to start the rolling restart [18:32:44] Dog walk [18:51:19] ryankemper just got back from lunch, will start shortly [19:02:14] OK, I've depooled eqiad and I'm starting the rolling operation again. We have 12 more hosts (all masters) to restart [19:14:01] @#$@#! I forgot to update the packages after cloudelastic. Which means we're doing this again ;( [19:14:49] starting with eqiad this time since it's already depooled [19:22:28] at least one of the servers (1074) wasn't able to start after applying the package, checking it out now [19:22:38] I've stopped the cookbook for now [19:22:41] :S [19:22:57] * ebernhardson wonders how long until i stop trying to ssh to elastic1074 [19:24:08] 025-07-23T19:13:23,688][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [cirrussearch1074-production-search-eqiad] fatal error in thread [opensearch[cirrussearch1074-production-search [19:24:10] -eqiad][clusterApplierService#updateTask][T#1]], exiting [19:24:12] java.lang.ExceptionInInitializerError: null [19:25:18] yeah, that's odd...no problem with the omega service [19:25:34] I was able to rstart it up manually [19:25:47] err...I was able to start opensearch_1@production-search-eqiad.service manually [19:26:27] Looking at the code, it's a weird error. I'm glad it works cuz i don't have ideas :P [19:27:21] It essentially is creating pre-configured ukranian analysis components. That component has a function that converts the token stream to a new token stream. For some reason the token stream that was passed in was null [19:27:46] or maybe something inside the stream was null, NPE is from java.io.Reader on the input stream [19:28:21] maybe some oddness reading stopwords.txt ...i dunno, glad it fixed itself:P [19:28:33] We've definitely deployed in relforge and cloudelastic without seeing this error [19:28:51] anyway, I'm gonna pick it back up again [20:11:33] hmm, looks like we collect ` --collector.systemd.enable-restarts-metrics` from node exporter, maybe we can use that to verify which hosts have been restarted [21:11:03] * ebernhardson maybe should make the binomials use a direct test instead of bootstrapping...it does actually take quite some time when i ask it to calculate them all in sequence... [22:10:34] sigh...printing the report to html shows some of my graphs...but not all of them :S [22:10:56] it's weird that the integer distribution renders fine, but the ci distributions don't [22:18:08] ryankemper looks like cirrus eqiad is unhappy again, good thing we depooled ;) [22:21:32] stopping 1081 again, let's see if that's the one that does the trick [22:23:41] nope, trying 1100 now [22:26:45] cluster's in recovery [22:27:06] really wish we understood how it gets there :S [22:27:11] but at least can fix it :) [22:28:58] It was triggered by the restart of cirrussearch1122.eqiad.wmnet...just a single master host [22:29:47] * inflatador wonders if our exporters have metrics about which host was the active master [22:30:37] when this is done I can try and restart the active master, maybe that will trigger it again [22:32:46] +1 to testing the restart [22:33:13] ebernhardson ryankemper I have to hit the drugstore before it closes, can y'all keep an eye on the cluster? eqiad is depooled so should be no impact. Be back in ~30 [22:33:30] kk [22:33:38] thanks, brb [22:33:38] inflatador: np [22:42:12] wow, way better than i expected. pasted my functions using bokeh into claude and asked for seaborn and it kinda works. Well, it draws some graphs, but it wrecked my analysis :P Now i have to review and understand how it broke it... [22:42:31] it somehow wrecked the table which doesn't involve any graphs and shouldn't have changed... [22:47:57] back [22:48:26] looks like we had another quorum failure about 10m ago? Did y'all do anything to fix it? [22:49:01] sorry, didn't see :S [22:49:47] i guess i was assuming `watch curl -s https://search.svc.eqiad.wmnet:9243/_cat/master` would error [22:51:37] well...I could be jumping to conclusions [22:52:30] I saw a ton of alerts in operations, including `HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9243): Read timed out` which generally means the service is hard down [22:52:41] hmm, indeed it does [22:53:06] but I don't see a ton of timeouts in the cookbook output like I did the last few times it happened [22:55:02] looks like it recovered on its own within 5 mins, which is an interesting piece of data in itself [22:55:07] The stuff coming thru operations has been a bunch of alerts resolving [22:56:01] the last set of timeout alerts started at 22:39:01 [22:56:18] and started clearing at 22:43:55 [22:57:24] Hmm we should eventually look at if there’s a way to collate alerts a bit because we’re really flooding the channel when we have these [22:57:54] yeah, I think alertmanager can deduplicate, but I haven't looked in to how [22:58:23] inflatador: with eqiad done shall I proceed to codfw? (after pooling) [22:59:01] ryankemper let's try restarting the active master first [22:59:23] that'd be `cirrussearch1122`. Starting now... [23:00:49] sure enough that did it [23:02:59] hmm, can we force an election prior to restarting the current master? Shouldn't have to though [23:03:25] starting the service again didn't help either [23:04:04] I don't think we can force an election, but we can remove a host from voting [23:05:38] all right, I just stopped 1100, the cluster is coming back now [23:08:24] * inflatador belatedly downtimes the eqiad hosts [23:16:52] ryankemper I downtimed eqiad for an hour, but it can be removed as I just repooled it. I'm out of time for today, so feel free to do CODFW. I added some observations to T400160 , feel free to add anything I missed there [23:16:53] T400160: Investigate eqiad cluster quorum failure issues - https://phabricator.wikimedia.org/T400160 [23:22:46] Cool proceeding to codfw in 10 mins [23:36:17] heading out [23:44:51] .o/ [23:45:27] alright gearing up