[09:33:22] errand+lunch [13:12:10] dcausse looks like we forgot this one: https://gerrit.wikimedia.org/r/c/operations/grafana-grizzly/+/1110833 Are we still OK to merge? [13:15:45] o/ [13:15:50] inflatador: looking [13:16:50] inflatador: I think so? I might have to rebase perhaps, looking [13:17:06] no it rebased cleanly [13:17:40] cool, I think it requires a grizzly deploy, will work on that with Ryan at our pairing today [13:24:25] sure, thanks! [13:55:31] pfischer: when you have a moment could you take a look at https://gerrit.wikimedia.org/r/c/wikimedia-event-utilities/+/1120205 ? it worked well when I backfilled articecountry, it's not ideal but it's the easiest solution I found at the tie [13:55:35] s/tie/time [14:09:56] \o [14:12:55] * ebernhardson sometimes wishes opensearch would say "I see you've tried to use lucene syntax and i've switched to using it" instead of "Bad human, no lucene for you!" [14:13:41] there is a toggle, it could, but instead it just complains :P [14:38:17] o/ [14:44:41] dcausse: looking [14:44:46] thx! [14:45:48] dcausse: +2 done [14:46:29] thanks! [14:50:06] .o/ [14:54:23] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133429 yet another puppet patch for fixing var paths [14:54:35] ebernhardson: you mean the query_string failing to parse? [14:56:23] dcausse: i mean when you use lucene syntax it pops up a box in the bottom right that says you tried to use lucene syntax but its set to dashboard query language, and links the DQL docs [14:56:47] but it never remembers that i set it to lucene syntax, it reverts back to DQL every time i visit [14:57:04] oh in opensearch dashboard, yes I stumble on this frequently [14:57:31] i just feel like better UI would be switching it for you, instead of pushing DQL [14:57:56] yes, tbh I'm often lost in opensearch dashboard... [14:58:36] i suppose i haven't looked closely enough, maybe DQL has real bools or some such that makes it better [15:44:40] gehel: anything interesting at https://app.asana.com/0/0/1209864449777614 ? I don't have access in asana, but it's related to making search bar more prominent [15:57:00] workout, back in ~40 [15:59:21] also, on the plugins errors d-causse and I were seeing, there are differences in `plugin.mandatory` vs relforge. For example `extra-analysis-esperanto` is listed as mandatory on cirrussearch, but not on relforge. We have `opensearch-extra-analysis-esperanto` instead [16:53:50] ryankemper another CR to fix some vars: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133481 [17:05:36] ^ self-merged that guy [17:11:35] opensearch is up and running, but not joining cluster. Probably due to firewall rules...checking [17:30:31] nice! any luck with joining cluster? I can help look if not [17:32:26] feast your eyes on this! https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133487 [17:33:16] makes sense [17:34:54] sadly, it doesn't look like the envoy changes made much difference in the connection failures :( They aren't high rate, but they continue [17:35:04] ryankemper did you end up reimaging cirrussearch2055 last night? I just wanna make sure we do at least one complete teardown/rebuild before starting the migration [17:35:41] ah, that's too bad...do you know if they deployed mw yet? Maybe it hasn't taken effect? [17:36:01] i suppose i was assuming they had but hadn't double checked. They should have had a deployment window for EU [17:36:43] yea there was a scap sync-world at 10:53 UTC today [17:37:15] we also had a very large spike, ~12k over 30s. Lines up with Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs2013.*,lvs1019.*} and A:lvs [17:37:26] so sounds like that restart isn't too graceful :( [17:37:34] LOL [17:38:13] i suppose i'll also poke at the cirrus retry logic. This isn't super important, but it bugs me that we fail user requests :P [17:39:14] OK, looks like the firewall rules will work...running puppet in codfw [17:39:24] this reminds me that we need new cumin aliases for cirrussearch [17:39:38] i kinda wish there was a nice way to remove spikes in the opensearch dashboard, it's hard to tell if there was a reduction between yesterday and today since that spike causes the rest of the graph to be almost nothing [17:42:02] ryankemper come to think of it, we probably don't want to teardown cirrussearch2055...at least not if it has primary shards. Might have to do another one [17:42:55] `bking@cirrussearch2055:~$ curl -s http://0:9200/_cat/nodes | grep cirrus [17:42:55] 10.192.23.21 1 21 6 1.12 0.50 0.34 dir - cirrussearch2055-production-search-codfw` [17:44:44] actually the rate might be reduced, looking at 08:00-18:00 the 30th had ~900 31st had ~2k, 1st had ~1k, 2nd (today) had ~270. But with that kind of variance hard to say [17:46:38] 27th had 500, 26th (last wednesday) had 400. So all over the place. Today is lowest, but maybe let it run for a week and hope it stays lower [17:49:23] yeah, it doesn't seem to hurt anything at least [17:50:30] lunch, back in ~45 [18:03:55] inflatador: didn’t reimage [18:37:16] back [18:40:00] just fyi: we're getting more traffic on WDQS again... https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&var-cluster_name=wdqs&var-graph_type=%289102%7C919%5B35%5D%29&viewPanel=44 [18:41:29] gehel did I miss an alert, or what tipped you off to this? [18:42:13] btw also noticed (via sukhe) that cloudelastic1008 is back up and joined the cluster/has shards, but isn't pooled [18:42:51] it's basically doing 90%+ of the expected work if it's in the cluster, being pooled is minor [18:44:31] yeah, I just saw that. DC Ops worked some magic [18:44:44] 2.5 TB of data in chi, so definitely in use [18:51:19] ryankemper ACK, let's find another canary we can use to test the cookbook [18:59:20] relatedly, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133495 is the patch that should configure envoy to auto-retry connection issues [19:05:27] perhaps interesting paper - https://irrj.org/article/view/19625 - "Don't Use LLMs to Make Relevance Judgements", which presents the keynote from the ACM SIGIR 2024 in paper form [19:06:27] :eyes [19:10:16] and...merged [19:13:39] thanks! [19:14:30] inflatador: we were looking at those graphs with ryankemper. I don't think anything is broken yet, so I don't think we should have had an alert. Bu twe might soon if the trend continues :) [19:19:41] ACK [20:35:46] yea it's interesting, also probably relevant to keep in mind he keeps comparing it to TREC (and others), but TREC is a bit unique in that they hire a team of people to do the grading there