[11:36:43] lunch [14:22:53] o/ [14:29:27] inflatador: there are some WDQS issues that are paging. Can you follow up in -operations or in #wikimedia-sre ? (cc: brouberol / stevemunene, just in case) [14:30:14] gehel ACK, checking...will follow up in #SRE [14:35:11] Trey314159 gehel are we doing retro today? [15:01:05] gmodena & gehel: I'm happy enough to skip the retro. [15:26:04] +1 [16:00:08] The consensus seems to be no retro [16:00:32] ack [16:04:13] Sorry for late reply, I was in another retrospective :) [16:04:32] Ok, let's cancel retro for today. We'll do the next one with David and Erik! [16:04:40] Sounds good [16:19:32] Trey314159 thanks for the review on the articlecountry patch [16:19:55] and good point re multi-language support for the terms. [16:19:59] gmodena: happy to help! [16:27:59] break/workout, back in ~40 [17:02:39] dinner+kids [17:09:37] back [17:50:09] ryankemper FYI, we had a WDQS incident this morning, report here: https://wikitech.wikimedia.org/wiki/Incidents/2025-02-27_wdqs_500_errors [17:55:29] gmodena: I'm going to comment on the A/B tests on T385972. If that's not the right place, I can move my comments to the right place—just let me know. (It'll be a while yet, I have to write down my thoughts.) [17:55:30] T385972: Deploy and test new MLR models - https://phabricator.wikimedia.org/T385972 [17:57:23] inflatador: thanks a lot for the report! [18:37:51] Trey314159 thanks! T385972 looks like a good place. [18:37:51] T385972: Deploy and test new MLR models - https://phabricator.wikimedia.org/T385972 [18:42:49] Trey314159 I'm looking forward to reading your take! So far my analysis has been limited to verifying that we did not hit any major regression (control vs 2025-02 and control vs previous November A/b test) [18:44:06] but I lack the expertise to understand/explain nuances :) [18:52:01] It can also be hard to distinguish nuance from noise. :( [18:52:59] eh [18:53:45] i've been hacking on plots today in an attempt to make them more readable, but so far i think i made them worse :| [18:57:11] lunch, back in ~40 [19:07:05] time to log out! Have fun! [19:15:45] * gmodena waves good night [19:44:54] I'm not feeling too good...going to take the rest of the day off [19:45:50] ...as soon as I look into that Elastic alert [19:46:13] ryankemper heads up on https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh ... I'm checking it out now [19:47:54] gmodena looks like we're seeing the latency spike up again, ref https://grafana.wikimedia.org/goto/g7C2IStHR?orgId=1 [19:53:42] don't have time to look closely now, but it looks like we are getting some plugin-related errors in Relforge: https://logstash.wikimedia.org/goto/1aefbb8c972b3888d0cef4e672ecd0f3 Probably should investigate before we migrate cloudelastic [20:10:04] taking a brief look at the latency spike before i have to take the dog out [20:18:56] https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1&from=1740682735860&to=1740685660159&viewPanel=53 this graph is very telling. threadpool practically empty and then it sprints up to 400 and then 700 within a couple minutes [20:19:28] this definitely would line up with a job being run causing the issue [20:22:46] ryankemper nice find! I thought we stopped running those AB tests, though. I wonder what other jobs might be involved [20:23:14] I'd note that the p50 seems completely fine as well. That makes me think that it's the job-related requests themselves taking far too long (~seconds) to execute, as opposed to the whole eqiad cluster getting backed up in a way that impacts all requests roughly equally [20:23:37] inflatador: yeah, that's where i'm lacking context. i don't know what actual type of job this would be if those ab tests aren't enabled [20:25:13] ryankemper I'll ask in Slack and CC you. in the meantime I'll update T387176 with your findings [20:25:13] T387176: Investigate eqiad Elastic cluster latency - https://phabricator.wikimedia.org/T387176 [20:26:00] we're also getting pool counter rejections https://grafana.wikimedia.org/goto/3ZwBvItHg?orgId=1 [20:26:49] oof. alright strike what i said about other requests likely not being affected then :) [20:29:30] that actually would explain why i didn't see qps significantly rising despite the increase in the threadpool. we're seeing qps on the elastic side but not the total requests hitting mediawiki in the first place [20:32:12] https://grafana.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters?orgId=1&from=1740679810987&to=1740688191286 actually if i sum the rejections and successful requests with my eyeballs it still doesn't seem like a big increase in total query count. curious [20:32:27] I'm not seeing anything too exciting in Logstash so far https://logstash.wikimedia.org/goto/8804bb485936682eac7cd52f2921a3ed [20:34:57] So the log entries there for queries taking multiple seconds are `BulkByScrollResponse`, which is apparently `Response used for actions that index many documents using a scroll request` [20:35:42] That language confuses me somewhat, I'm familiar with scroll requests as a way to paginate a query response with many matching documents...but why does it say it's *index* many documents? index to me sounds like a write not a read [20:36:17] it might just be a way they use the word index that i'm unfamiliar with tho, i.e. maybe index here means "return a document matching the request" [20:37:14] I could also be barking up the wrong tree entirely, kibana is hard to make sense of sometimes :) [20:38:18] pool counter rejections have recovered btw [20:40:05] For later, here's a view of all the eqiad p95 spikes. we had some initial very small spikes a few weeks ago, and then a few bigger incidents like this one across the last few weeks https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1&from=1739457916360&to=1740688645250&viewPanel=19 [20:41:24] p95 has recovered now. I think whatever was bogging us down has ceased. [20:44:25] We'll have david back on monday & erik on tuesday, so they should be able to help us figure this out. last note: the p95 issues across the last few weeks to seem to start around when https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1118785 was merged; it's not clear though why the issue would still be present with that change being reverted on feb 24 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1120534 [20:44:49] unless the mlr stuff gets scheduled in advance so when we revert it it takes a week or so for the revert to propagate. seems a bit farfetched tho [20:45:00] anyway search cluster's in a good state for now, going to take the dog out [20:45:27] I gotta go too, taking the rest of the day off [22:33:08] gmodena: I read.. okay, I skimmed all the reports (and read a couple more closely).. notes are on Phab: https://phabricator.wikimedia.org/T385972#10589233