[08:12:53] inflatador oof... saw the graph. [08:12:56] o/ [09:34:23] ryankemper: BulkByScrollResponse seems to indeed be about indexing documents. Once servers are overloaded, it is reasonable to assume that write operations will also take more time. Or we might have a job that does a bunch of updates at a specific time? Or even an external bot updating a bunch of pages at 8pm, generating a bunch of page reindex? [11:47:04] Trey314159 thanks for the write up. Let's touch base monday or wed? jawiki puzzles me a bit, but right no we have no LTR enabled on it. That's one case we'll need to handle with care. [11:48:04] I'm on the same page re not deploying 2025-02 models. I wonder if it would be worth planning these a/b test at least once a quarter, in the hope to catch some seasonality effect (if it makes sense)? [13:45:12] gmodena: not clear to me: is there more work needed on T386068? [13:45:12] T386068: Implement articlecountry a new CirrusSearch keyword - https://phabricator.wikimedia.org/T386068 [13:53:22] gehel based on latest comments in phab we should be ready (modulo testing). I had a small f/up patch to limit number of term in search, but it might not be needed after all. [13:58:08] gehel i'll add a comment in that thead [13:58:20] inflatador o/ [13:58:47] gmodena: thx! [15:11:21] going in to office, back in ~30 [16:12:56] gmodena: to be clear, I'd be okay with deploying the 2025-02 models. The Japanese model is weird, but the new one is clearly better than the old one. I was just using that as an example where one is better than another and thinking about how to automate that decision. [16:13:03] Running quarterly A/B tests would be interesting. I'd also like to compare a much older model with a newer one to try to guage whether changes from model to model are random fluctuations, or if there is real drift in a consistent direction—i.e., older models don't perform as well because something has actually changed over time, either in our data or our users' search behavior. [16:21:29] Trey314159 ack [16:22:34] re older models. It would be interesting, and I think feasible. AFAIK we do have the full history of training models. [16:24:22] what it'd be nice to have IMHO is a way to persist the results of the A/B. Maybe in some iceberg table? I've been toying a bit with the idea today (to support some doc in scope for T385972) [16:24:22] T385972: Deploy and test new MLR models - https://phabricator.wikimedia.org/T385972 [16:27:25] or maybe ml folks already have some form of tracking we could piggyback on. There's no shortage of tooling for these use cases :) [16:30:05] Trey314159 I do like your suggestion regarding improving the notebook's readability. I'll give it a try on Monday to see how much work it takes to implement the changes. [16:31:04] but for today, I'm calling it a day :). Happy Friday - enjoy the weekend! [16:35:13] Have a good weekend! [16:36:40] .o/ [16:50:51] time to start the weekend! [16:51:15] And I have friends coming over for a raclette - https://en.wikipedia.org/wiki/Raclette [19:13:34] Ahh, the Power of Cheese!™ https://www.youtube.com/watch?v=-f_d7JBIwMA [19:48:06] latency alerts again ;( [19:59:27] Good news though, this one looks pretty clear https://logstash.wikimedia.org/goto/8f2da3b11aae385ac635965dd69eb76a [20:00:13] `org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable@28be706e on QueueResizingEsThreadPoolExecutor[name = elastic1066-production-search-eqiad/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 310.3ms, adjustment amount = 50, [20:00:13] org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor@4e773165[Running, pool size = 61, active threads = 61, queued tasks = 1000, completed tasks = 62763343]]` [20:46:36] So this does seem capacity-related, but also constrained by the thread pool write queue size, which we set with https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/role/common/elasticsearch/cirrus.yaml#73 . Would it be safe to experiment with raising this value? It seems we haven't changed it in awhile [21:20:07] need to look at re-re-purposing those Relforge hosts too [21:25:38] Hmm yeah I could go either way on the write queue [21:26:15] Feels like given the writes are taking multiple seconds that bumping the threadpool probably won’t help a ton, but it’s hard to say [21:26:36] I really want to figure out where these requests are coming from in the first place…quite difficult though [21:33:03] Yeah, agreed on all counts [22:10:08] still a lot of `BulkByScrollResponse` messages in logstash...Bulk API uses the write thread pool (ref https://www.elastic.co/guide/en/elasticsearch/reference/7.17/modules-threadpool.html ). Docs imply the default write thread pool settings are "size of # of allocated processors, queue_size of 10000s. " [22:10:35] whereas ours is set to 6, queue size of 1000 [22:10:58] There's probably a very good reason for that, but we still might want to look into that