[13:42:08] bpirkle: regarding Kask, looking for perf insight. Is there a Grafana dash yet with e.g. req rates, response codes, latency etc ? [13:43:18] Krinkle: https://grafana.wikimedia.org/d/000001590/sessionstore?refresh=1m&orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-service=sessionstore and https://grafana.wikimedia.org/d/000000418/cassandra?panelId=1&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fservices&var-cluster=sessionstore&var-keyspace=sessions&var-table=values&var-quantile=99p&from=1564581156297&to=1564667556297 [13:43:39] Eric Evans shared those. There may be more that he's looking at that I'm unaware of [13:44:08] Thus far, Kask is just being used for testwiki, but next week we plan to start rolling it out to a wider set of wikis [13:51:43] bpirkle: ok, making a few minor edits to the first one. Checking now if there's any metrics missing. [14:23:29] bpirkle: I've added a row that displays the latency using the buckets from Prometheus (the percentile graph wasn't very meaningful because we don't have the information to generate percentiles accurately). [14:23:29] https://grafana.wikimedia.org/d/000001590/sessionstore?refresh=1m&orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-service=sessionstore&var-method=All&from=1564749883897&to=1564757083897 [14:23:54] grouping and colors based on T211721 [14:23:55] T211721: Establish an SLA for session storage - https://phabricator.wikimedia.org/T211721 [14:24:06] Krinkle: 👍 [14:24:14] urandom: fyi [14:24:22] assuming this doesn't include network latency, looks like those numbers aren't as low as we expected? [14:24:38] but we'll see I guess, maybe the test load isn't representative [14:55:52] Krinkle: no, they aren't [14:56:48] Krinkle: what's noteworthy is that about half of them are in single digit milliseconds, and the other half in 10s of milliseconds [14:57:49] the distribution makes me think there is something that adds ~20-30ms to about half the requests [14:58:04] * Krinkle splits "over 5ms" into "5-10" and "over 10" [14:58:06] for GET [15:02:03] right [15:02:03] https://grafana.wikimedia.org/d/000001590/sessionstore?panelId=47&fullscreen&orgId=1 [15:02:05] Yeah [15:02:32] Its' about 50% < 2.5 ms, and 50% over 10 ms. basically nothing in between [15:03:15] yeah, that is umm...known [15:04:16] Krinkle: T221292 [15:04:17] T221292: Establish performance of the session storage service - https://phabricator.wikimedia.org/T221292 [15:05:46] Krinkle: it was decided at that time to defer this issue to a later time and avoid prematurely optimizing [15:06:06] I've been meaning to raise whether now might be the time to consider revisiting [16:29:28] urandom: I'd be curious to see what the p99 looks like over time. E.g. not for the whole dataset, but for each segment of 10 seconds or per minute, and then run for 15 minutes. If the p99 is that like for every segment of 10 seconds, that'd be pretty worrying I think. [16:30:20] Krinkle: not sure I follow [16:30:59] anything lower than p99 isn't of interest to me at such a small scale because that 1% is all users all the time, assuming even distribution, and not something unique or wrong with the request causing the slow down. [16:31:56] The way it scales up is also suspicious to me. Unless the lowest concurrency/size setting already saturated the CPU or network, why does it already get slower immediately? [16:32:18] 30 ms is a pretty significant amount of time for something so trivial. [16:32:53] I feel like I'm missing some context [16:34:31] This is for retreiving the session object in MW, for every web request a user makes. So 1 or 2 roundtrips of this is the minimum before anything meaningful has happened yet. That means the minimum response time for MW is bound by 30 - 300 ms depending on what the typical concurrency/size will be, and on top of that comes 1 or 2 DB roundtrips go get MW started and other PHP stuff. [16:35:18] And then comes the actual thing the user wanted to do (read a page, query the history, save an edit, perform a search query etc.) and its latency [16:36:06] The current status quo, which I have not confirmed (sorry), is from Redis typically 1ms response time per the Phab task. If that's true, then that's a very major shift. [16:36:07] I mean, what do you mean wrt "scales up" and "already get slower immediately"? It hasn't been scaled up yet afaik [16:36:22] Krinkle: oh, it is [16:36:30] urandom: T221292 shows that p99 goes from 39 to 300 ms with higher concurrency and size [16:36:31] T221292: Establish performance of the session storage service - https://phabricator.wikimedia.org/T221292 [16:36:50] oh, there's the context [16:36:56] :) [16:37:08] sorry, I went a bit fast there. Had something else in mind. [16:37:40] that p99 goes up as it nears saturation though [16:38:05] based on the numbers I've been given, we shouldn't be anywhere near that, either [16:38:24] which one gets saturated first? kask network or cpu, or cassandra disk? [16:38:25] of course...we're already above that in the production [16:38:32] CPU [16:38:41] interesting. [16:39:05] Yeah, when it's below saturation at the lowest concurrency/size, it shouldn't be taking 30 ms even at the max/p99. [16:39:51] I also forgot to follow up on the original task where Evan wrote 1% error rate, that's unacceptable. Any service with 1% error rate isn't worth putting in production (and indeed, it's no where near that, so we're good, but that definitely needs to be changed). [16:40:45] Krinkle: that whole ticket is bunk [16:41:18] it's based on numbers I gave originally, which where a troll of Ian [16:41:41] right [16:41:46] ah, I see the Redis dash now [16:41:50] I din't know we had that, cool. [16:41:54] I was being told that the latency didn't matter, and so I was trying to elicit a response [16:41:54] * Krinkle goes and adds latency buckets to that dash [16:42:47] urandom: thanks, that makes sense (I support that method :) ) [16:43:39] btw, I wrote more about why I focus on p99 etc and how it scales to users at https://phabricator.wikimedia.org/phame/post/view/83/measuring_wikipedia_page_load_times/ - maybe not for you, but can help in trying to explain it to others :) [16:46:02] hm.. no _buckets for the redis metric [19:42:06] AaronSchulz & James_F, looks like to unbreak https://integration.wikimedia.org/ci/job/mwext-php72-phan-docker/5971/console we would need to document the return type for wfGetDB() as MaintainableDBConnRef [19:45:53] OK; did that change recently? [19:50:57] It used to be Database (kitchensink) instead of IDatabase. In order to be able to avoid duplicate connections in prod, we made it return ConRef which required picking whether it would be a DBConRef (regular queries) or MaintainableDBConnRef (maintenance scripts, installer etc.) [19:51:00] and picked the former. [19:51:26] Ah, and that broke phan in anything that used it like this? [19:52:35] It can only affect maintenance scripts, and specifically maintenance scripts that use a method that uses the very rare typehint of MaintainableDBConnRef [19:52:39] which even CentralAuth doesn't use [19:52:53] * James_F nods. [19:53:03] but.. CentralAuths maintenance script actually extends another maintenance script in core, which Aaron helpfully added the MaintainableDBConnRef typehint to. [19:53:13] *that* is what upset Phan as it now expects the subclass in CentralAuth to match it. [19:53:15] which I missed in CR. [19:53:21] Ah. [19:53:42] dbarrat just fixed that [19:53:48] * James_F nods. [19:54:11] I didn't think to look for subclasses on a maintenance script, admittedly fairly rare, but noted to look out for :) [22:31:17] davidwbarratt: sorry about the mess, should be good now. [22:47:25] Krinkle no problem! thanks for your help! :)