[07:24:20] cirrus-streaming-updater consumer-search is borked in eqiad since yesterday 21:00 :/ [07:27:39] it's still running an old version of the job [07:28:34] Caused by: org.elasticsearch.client.ResponseException: method [POST], host [https://search.svc.eqiad.wmnet:9443], URI [/_bulk?timeout=120000ms], status line [HTTP/1.1 504 Gateway Timeout] [07:41:05] deploying new version seems to help [07:42:43] weird there's a contant low-rate (~0.5 rps) of cirrus codfw.rejected errors since may 22... [08:07:45] still having issues with the updater in eqiad... [08:08:31] need to double-check but the timeout seems to always involve :9443 [08:52:28] unsure that's it but cirrussearch1110 is running psi but in the omega lvs pool [09:17:08] psi & omega are red in eqiad [09:33:04] I have deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1152020 and now pooled cirrussearch1110 for psi. [09:34:28] thanks! [09:34:46] deleted red indices, let's if this helps the update pipeline to stabilize [09:34:50] *see [09:37:01] clusters are now green [09:37:16] going to recover these indices now [09:55:20] filed T395546 to track progress [09:55:20] T395546: opensearch psi and omega clusters red in eqiad - https://phabricator.wikimedia.org/T395546 [09:58:47] good news is that update pipeline is working again and lag is starting to decrease, bad news is that it'll take some time to recover these indices, I seem to have lost my script to copy from one cluster to another :( [10:17:05] we'll have some '_first' indices, it's been a while, I hope it's not going cause issues on some tooling :) [10:17:21] lunch [13:11:03] sounds like we had some issues with EQIAD, anything I can do to help? [13:12:59] o/ [13:13:41] inflatador: currently finishing to restore/check lost indices but trying understand what caused this would be helpful [13:14:29] something happened yesterday around 21:00 UTC causing psi & omega to lose primary shards on ~60 indices [13:15:37] I was about to check the masters logs around that time to understand what could have happened [13:17:04] I know we started some decom work around that time [13:17:33] but the hosts we removed had already been banned for some time, and the clusters were green when I left [13:17:42] dropping nodes could explain [13:18:24] the logs on the master should tell when we went red [13:19:42] there was a small discrepancy in node declaration, I initially thought that was the cause of the update pipeline to misbehave but might just have been a red herring [13:19:54] Not sure if it would help, but we can put those nodes back into rotation [13:19:54] (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1152020) [13:20:13] inflatador: no I don't think so [13:22:41] then I think that's the root cause, we pulled too many nodes out before the shards could recover [13:23:10] btullis dcausse Steve and I are in https://meet.google.com/xfs-pkcw-mbu if you want to discuss more [13:23:21] sure [13:23:32] we could also restore the indices from DFW via snapshot if that helps [13:39:21] \o [13:40:21] curious, i imported enwiki with opening_text copied to suggest, it only added 6gb to enwiki_content [13:40:58] .o/ [13:46:33] o/ [13:49:09] results are midling :P Well, i have to come up with an actual test suite, but the single test case of tayps of wlding difacts only gets to `taps of welding defects` on enwiki_content (w/opening_text), curiously simplewki_content gets the full fix to types of welding defects [13:49:29] suggests there is some weight to the idea of constructing corpuses to suggest from, instead of whatever we have, but i dunno how to do that :P [13:59:56] "taps of welding defects" is already far better :P [14:13:00] I was wondering if a dedicated index per language, possibly smaller with a single shard and high replication could be interesting to fine-tune the language model and possibly get better perf [14:14:14] that's definitely more work (need some more tooling), and I have no clue how we could "fine-tune" this, so maybe not worth it [14:45:31] dcausse incident report is up at https://wikitech.wikimedia.org/wiki/Incidents/2025-05-29_OpenSearch_clusters_unavailable , feel free to add/change anything [14:45:41] thanks! [14:45:58] I also just tested my ban script against cloudelastic and it worked, so I don't think that was the issue [14:46:11] if you want to check it out though LMK [14:48:32] break, back in ~20 [15:01:49] dcausse: yea i've been pondering that idea, switching the copy_to is super easy and gets some of the way there, a dedicated suggest index gives more opportunity to construct a better corpus, but i wonder how much we could do that outside 2 or 3 languages [15:04:18] but i would mean we have the opportunity to fix things, via adjusting statistics. With the copy_to we get whatever we get [15:06:04] yes not sure that's worth it, with copy_to you at least get suggestions that are supposed to be in your index, with a generic index it's not so guaranteed [15:30:47] i'm also curious what it looks like to suggest straight from queries...while probably cant deploy it puting together a test that dumps a bunch of queries that resulted in clickthroughs into relforge [15:39:08] curious: https://docs.opensearch.org/docs/latest/api-reference/document-apis/bulk-streaming/ [15:40:42] interesting [15:41:07] true that batch_size estimation has always been kind of a problem [15:41:11] sorry, been back [15:41:33] at that point they almost need to dump http, but i kinda understand why they dont [15:59:09] heading out [16:03:37] .o/ [18:26:36] ryankemper ebernhardson I've borrowed Observability's OpenSearch dashboard to create a more SRE-focused dashboard at https://grafana.wikimedia.org/goto/IS6r51fHg?orgId=1 . Obviously we'll need to pull in a lot more panels from our existing dashboards, feel free to add panels, offer feedback etc. Ref T392222 [18:26:37] T392222: Create ops-focused OpenSearch dashboard - https://phabricator.wikimedia.org/T392222 [18:32:38] nifty! [19:04:43] Needs a lot of work, but hopefully a good start [20:50:45] taking cat to vet, back in ~1h [21:29:34] back