[11:36:43] <gmodena>	 lunch
[14:22:53] <inflatador>	 <o/
[14:29:19] <gmodena>	 o/
[14:29:27] <gehel>	 inflatador: there are some WDQS issues that are paging. Can you follow up in -operations or in #wikimedia-sre ? (cc: brouberol / stevemunene, just in case)
[14:30:14] <inflatador>	 gehel ACK, checking...will follow up in #SRE
[14:35:11] <gmodena>	 Trey314159 gehel are we doing retro today?
[15:01:05] <Trey314159>	 gmodena & gehel: I'm happy enough to skip the retro.
[15:26:04] <inflatador>	 +1 
[16:00:08] <Trey314159>	 The consensus seems to be no retro
[16:00:32] <gmodena>	 ack
[16:04:13] <gehel>	 Sorry for late reply, I was in another retrospective :)
[16:04:32] <gehel>	 Ok, let's cancel retro for today. We'll do the next one with David and Erik!
[16:04:40] <Trey314159>	 Sounds good
[16:19:32] <gmodena>	 Trey314159 thanks for the review on the articlecountry patch
[16:19:55] <gmodena>	 and good point re multi-language support for the terms.
[16:19:59] <Trey314159>	 gmodena: happy to help!
[16:27:59] <inflatador>	 break/workout, back in ~40
[17:02:39] <gmodena>	 dinner+kids
[17:09:37] <inflatador>	 back
[17:50:09] <inflatador>	 ryankemper FYI, we had a WDQS incident this morning, report here: https://wikitech.wikimedia.org/wiki/Incidents/2025-02-27_wdqs_500_errors
[17:55:29] <Trey314159>	 gmodena: I'm going to comment on the A/B tests on T385972. If that's not the right place, I can move my comments to the right place—just let me know. (It'll be a while yet, I have to write down my thoughts.)
[17:55:30] <stashbot>	 T385972: Deploy and test new MLR models - https://phabricator.wikimedia.org/T385972
[17:57:23] <gehel>	 inflatador: thanks a lot for the report!
[18:37:51] <gmodena>	 Trey314159 thanks! T385972 looks like a good place.
[18:37:51] <stashbot>	 T385972: Deploy and test new MLR models - https://phabricator.wikimedia.org/T385972
[18:42:49] <gmodena>	 Trey314159 I'm looking forward to reading your take! So far my analysis has been limited to verifying that we did not hit any major regression (control vs 2025-02 and control vs previous November A/b test)
[18:44:06] <gmodena>	 but I lack the expertise to understand/explain nuances :) 
[18:52:01] <Trey314159>	 It can also be hard to distinguish nuance from noise. :(
[18:52:59] <gmodena>	 eh
[18:53:45] <gmodena>	 i've been hacking on plots today in an attempt to make them more readable, but so far i think i made them worse :|
[18:57:11] <inflatador>	 lunch, back in ~40
[19:07:05] <gehel>	 time to log out! Have fun!
[19:15:45] * gmodena waves good night
[19:44:54] <inflatador>	 I'm not feeling too good...going to take the rest of the day off
[19:45:50] <inflatador>	 ...as soon as I look into that Elastic alert
[19:46:13] <inflatador>	 ryankemper heads up on https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh ... I'm checking it out now
[19:47:54] <inflatador>	 gmodena looks like we're seeing the latency spike up again, ref https://grafana.wikimedia.org/goto/g7C2IStHR?orgId=1
[19:53:42] <inflatador>	 don't have time to look closely now, but it looks like we are getting some plugin-related errors in Relforge: https://logstash.wikimedia.org/goto/1aefbb8c972b3888d0cef4e672ecd0f3  Probably should investigate before we migrate cloudelastic
[20:10:04] <ryankemper>	 taking a brief look at the latency spike before i have to take the dog out
[20:18:56] <ryankemper>	 https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1&from=1740682735860&to=1740685660159&viewPanel=53 this graph is very telling. threadpool practically empty and then it sprints up to 400 and then 700 within a couple minutes
[20:19:28] <ryankemper>	 this definitely would line up with a job being run causing the issue
[20:22:46] <inflatador>	 ryankemper nice find! I thought we stopped running those AB tests, though. I wonder what other jobs might be involved
[20:23:14] <ryankemper>	 I'd note that the p50 seems completely fine as well. That makes me think that it's the job-related requests themselves taking far too long (~seconds) to execute, as opposed to the whole eqiad cluster getting backed up in a way that impacts all requests roughly equally
[20:23:37] <ryankemper>	 inflatador: yeah, that's where i'm lacking context. i don't know what actual type of job this would be if those ab tests aren't enabled
[20:25:13] <inflatador>	 ryankemper I'll ask in Slack and CC you. in the meantime I'll update T387176 with your findings
[20:25:13] <stashbot>	 T387176: Investigate eqiad Elastic cluster latency - https://phabricator.wikimedia.org/T387176
[20:26:00] <inflatador>	 we're also getting pool counter rejections https://grafana.wikimedia.org/goto/3ZwBvItHg?orgId=1
[20:26:49] <ryankemper>	 oof. alright strike what i said about other requests likely not being affected then :)
[20:29:30] <ryankemper>	 that actually would explain why i didn't see qps significantly rising despite the increase in the threadpool. we're seeing qps on the elastic side but not the total requests hitting mediawiki in the first place
[20:32:12] <ryankemper>	 https://grafana.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters?orgId=1&from=1740679810987&to=1740688191286 actually if i sum the rejections and successful requests with my eyeballs it still doesn't seem like a big increase in total query count. curious
[20:32:27] <inflatador>	 I'm not seeing anything too exciting in Logstash so far https://logstash.wikimedia.org/goto/8804bb485936682eac7cd52f2921a3ed
[20:34:57] <ryankemper>	 So the log entries there for queries taking multiple seconds are `BulkByScrollResponse`, which is apparently `Response used for actions that index many documents using a scroll request`
[20:35:42] <ryankemper>	 That language confuses me somewhat, I'm familiar with scroll requests as a way to paginate a query response with many matching documents...but why does it say it's *index* many documents? index to me sounds like a write not a read
[20:36:17] <ryankemper>	 it might just be a way they use the word index that i'm unfamiliar with tho, i.e. maybe index here means "return a document matching the request"
[20:37:14] <ryankemper>	 I could also be barking up the wrong tree entirely, kibana is hard to make sense of sometimes :)
[20:38:18] <ryankemper>	 pool counter rejections have recovered btw
[20:40:05] <ryankemper>	 For later, here's a view of all the eqiad p95 spikes. we had some initial very small spikes a few weeks ago, and then a few bigger incidents like this one across the last few weeks https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1&from=1739457916360&to=1740688645250&viewPanel=19
[20:41:24] <ryankemper>	 p95 has recovered now. I think whatever was bogging us down has ceased.
[20:44:25] <ryankemper>	 We'll have david back on monday & erik on tuesday, so they should be able to help us figure this out. last note: the p95 issues across the last few weeks to seem to start around when https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1118785 was merged; it's not clear though why the issue would still be present with that change being reverted on feb 24 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1120534
[20:44:49] <ryankemper>	 unless the mlr stuff gets scheduled in advance so when we revert it it takes a week or so for the revert to propagate. seems a bit farfetched tho
[20:45:00] <ryankemper>	 anyway search cluster's in a good state for now, going to take the dog out
[20:45:27] <inflatador>	 I gotta go too, taking the rest of the day off
[22:33:08] <Trey314159>	 gmodena: I read.. okay, I skimmed all the reports (and read a couple more closely).. notes are on Phab: https://phabricator.wikimedia.org/T385972#10589233