[07:01:38] sigh... p95 latencies are still quite bad even with morelike out of codfw :/ [07:03:31] if we think this is related to T402048, we can still rollback! [07:03:32] T402048: Enable empty search recommendations to English Wikipedia - https://phabricator.wikimedia.org/T402048 [07:03:52] Even if it is temporary to validate that it has no impact. [07:04:59] Oh, latencies of full text search are still pretty bad, and MoreLike is on a different cluster. So rolling back T402048 is unlikely to help [07:05:02] gehel: can't be related since codfw (where we see latency issues) is not serving these recommendation [07:05:12] yep, I just realized that :) [07:05:12] yes :/ [07:06:42] I'm leaning towards suboptimal shard allocation (https://grafana-rw.wikimedia.org/d/000000607/cluster-overview?orgId=1&from=now-12h&to=now&timezone=utc&var-site=codfw&var-cluster=elasticsearch&var-instance=$__all) [07:07:58] with some hosts close to 90% cpu utilization [07:09:34] we could restart some of those busier nodes [07:12:14] gehel: Sorry for the delay, I am around now. [07:12:38] perhaps, will try to get a list of those and take a snapshot of the shard placement [07:43:56] also perhaps time to drop these old glent indices... [07:52:35] deleted them, will cause some shard movement I guess [08:35:51] we dropped: ~700 searches on enwiki vs 700k searches on all other wikis in the last 15h [08:36:09] something's not right in the poolcounter settings [08:36:51] 1000x less drops on enwiki? which is still like half our search traffic? [08:38:23] yes... tho not sure the last time we re-evaluated this 50% split between enwiki vs the rest [08:38:54] whatever the split, 1000x less for enwiki is very unexpected [08:39:04] true [08:39:27] wikidata being the most affect at 155k queries dropped in 15h [08:39:33] *affected [08:41:21] isn't the pool counter global to all wikis? I would expect a similar ratio of the number of dropped queries on all wikis, even if queries on some wikis are more expensive. [09:11:03] we have a dedicated pool for enwiki that's why [09:11:11] but seems like it's rather unfair [09:17:20] Oh, that makes more sense. [10:00:06] lunch [12:25:45] hawwiki gets 5x more full_text searches than frwiki [12:26:18] and our assumptions about the 50/50 split between enwiki and the rest seems completely wrong [12:59:34] hm not sure we can trust request_time_ms in cirrus backend logs... some p99s are like 3,165,155ms [12:59:53] well I hope it's a glitch [13:32:35] \o [13:36:16] indeed the latencies are still off :S [13:38:24] we could maybe turn down the ltr rescore window, if that is effective we could ponder maybe having a lower default (with some flag/profile that turns it back up to "normal") on api requests, preferring to spend compute on interactive requests [13:38:32] that might also not do much though :P unsure [13:46:33] also was pondering IO since thats what i was thinking yesterday...but on balance it's probably not io. I think i just like that answer because it was a problem in the past :P But looking at things, this feels more like cpu saturation than io saturation [14:01:44] shards were reshuffled a bit this morning after removing old glent indices but probably not enough [14:01:58] I still get the same ~10 nodes with high cpu usage [14:02:36] https://grafana-rw.wikimedia.org/explore?schemaVersion=1&panes=%7B%22k3k%22%3A%7B%22datasource%22%3A%22000000005%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22expr%22%3A%22instance_mode%3Acpu%3Arate5m%7Binstance%3D%7E%5C%22cirrussearch.*%5C%22%2Cmode%3D%5C%22user%5C%22%7D+%3E+0.7%22%2C%22range%22%3Atrue%2C%22instant%22%3Atrue%2C%22datasource%22%3A%7B%22type%22%3A%22prometheus%22%2 [14:02:38] C%22uid%22%3A%22000000005%22%7D%2C%22editorMode%22%3A%22code%22%2C%22legendFormat%22%3A%22%7B%7Binstance%7D%7D%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%22now-6h%22%2C%22to%22%3A%22now%22%7D%7D%7D&orgId=1 [14:02:42] too long :/ [14:03:47] pasting the query instead (codfw/ops): instance_mode:cpu:rate5m{instance=~"cirrussearch.*",mode="user"} > 0.7 [14:05:40] looking at percentiles of other wikis I see something suspicious with hawwiki/tumwiki/eewiki & test2wiki which have surprisingly high p50s compared to other wikis [14:06:20] between 600 & 700ms for the p50, up to 3s and more for the p95 [14:07:41] for instance today hawwiki is the 6th wiki sorting backend logs on query_type="full_text" summing the request_time_ms for the first elastic request [14:08:43] tumwiki the 7th [14:10:22] hmm, that is odd [14:10:39] they're 2nd & 3rd sorting by query count, just behing enwiki [14:10:54] still puzzled by all this [14:11:45] i don't even know what hawwiki is :P [14:12:05] it's small, 6k pages [14:24:32] my brain is probably not functionning properly but in https://phabricator.wikimedia.org/P83463 [14:26:00] how is that hawwiki being top3 sorting by cnt is only top6 in cumulative time while its percentiles are way worse than say ruwiki [14:27:52] hmm [14:28:38] dcausse: at first glance, that seems reasonable [14:28:50] At what point do we need to communicate more broadly on our issue. We're dropping queries, so we're impacting users. But not a lot at the moment? [14:28:55] dcausse: maybe try summing elsaticsearch_requests.requesT_time_ms? [14:29:22] dcausse: i guess we dont do the super cheap lookups in elastic anymore, but it's possible every request is failing over and hitting second try or something? [14:29:49] i guess they dont have language detection though, so it's not like they would fail over to an enwiki request [14:30:45] ah stupid me, should be F.sum("first_request_time_ms").alias("cumulative_time"), not F.sum("request_time_ms").alias("cumulative_time"), [14:31:05] I don't trust top-level request_time_ms it has insane values [14:31:31] interesting, i think we get that from mediawiki timing apis. no clue how it could be off :S [14:31:54] gehel: I don't, I experienced some failures myself [14:31:55] maybe the php 5.4 and 5.3 fallbacks? it's time to remove those anyways [14:31:58] *I don't know [14:32:16] yes the code relies on some globals I don't much about [14:32:23] *know [14:33:11] ok now sorting on first request cumulative time hawwiki is first then tumwiki then enwiki [14:33:26] i don't know either... we are dropping something like 15-30req/s, out of ~900 compsuggest and ~550 fulltext [14:34:38] compsuggest are likely to go unnoticed, fulltext you get an error [14:34:38] i suspect we could cut the LTR window and free up some compute, but not sure how much [14:34:57] anyways i gotta do a school run real quick, back in ~20 [14:46:36] bah hawwiki requests are a bot faking a UA [14:47:05] params are like: offset -> 4550, profile -> all, title -> Papa_nui:Huli, limit -> 20, ns0 -> 1, search -> REDACTED [14:47:49] and this hits commons [14:57:27] sigh, ok that explains something. Do we send the offset to commons too? [14:57:35] seems like we shouldn't paginate secondary searches [14:57:35] 1310664 searches with no cookies and an offset requested [14:57:48] vs couple thousands with a cookie [14:57:59] it's not secondary searches sadly [14:58:08] oh, it's NS_FILE? [14:58:12] yes... [14:58:33] hmm...do we poolcounter pagination separately, into the expensive bucket with regex? [14:58:39] maybe pagination > result 250 or something [14:58:42] I guess bots now target small wikis hoping they're less visible? [14:58:54] ah possibly [14:59:42] checking if it's the same for other "small" wikis with surprisingly high request counts [14:59:45] shouldn't be too hard, i could work that up today [17:32:49] cirrussearch full text latency alert going off [17:32:59] cooking food rn but will investigate in 10’ [17:33:31] ryankemper: it's been going off since the switchover, we think its https://phabricator.wikimedia.org/T405482 [17:33:47] i'm working up a patch today to move those into either the automated of expensivefulltext pool counters [17:33:59] s/of/or/ [17:34:56] the short is, someone is sending us 70 requests per second that hit commonswiki_file and ask for pagination (which is more expensive) [17:35:05] so most expensive index, and queries more expensive than normal [17:35:39] unfortunately they seem to be coming through either a bot or proxy network, peak requests per ip was like 150/day [17:36:00] ebernhardson: are the requests being made via POST, or other? [17:36:34] cdanis: hmm, not sure but probably GET requests, at least thats how search is usually submitted [17:37:05] if there are distinctive patterns in them as compared to 'normal' kinds of traffic, we could potentially write a requestctl rule in the interim [17:38:15] MW action API? [17:38:22] cdanis: the heuristic i'm implementing is requests that hit special:search on web (not sure if requestctl can know, because it gets translated to local language), have no cookies (bot signal, because every normal browser gets WMF-Last-Access), and asks for offset > 0 [17:39:01] i think i'm also going to limit it to requests that hit commonswiki_file, but again requestctl would have a hard time knowing that. [17:39:30] (all wikis query commonswiki_file if they have the right parameters, because mediawiki does an "instant commons" thing where commons files are usable and look local on all wikis) [17:39:54] right [18:19:15] * ebernhardson is not thinking of a great generic way to add this heuristic...it might just end up being wmf specific code [18:41:27] although on further thought ... maybe everything with a large offset+limit should go into the expensive bucket regardless [18:41:34] they are, as described, expensive [18:53:24] interesting theory, i found some mention that you can flag invalid user agents by looking at associated headers. For example a request claiming a chrome user agent, but not submitting Sec-Ch-Ua is probably not chrome [18:53:45] (probably not getting into that today though, but interesting) [19:24:47] * ebernhardson realizes he has no great way to test this function :P [20:07:38] That X-Is-Browser thing traffic is working on looks great, will want to keep an eye on that and integrate it into the poolcounter selection when available: https://docs.google.com/document/d/1HX5Gan19il5jLZjUmTxhVjR-Dbb-TWcKYoGTDxelvuo/edit?tab=t.0#heading=h.rpk5i4ous8vg