[17:14:29] dinner [17:33:01] hmm, i reworked my stats on natural language queries with a different approach...and got a completely different answer :P Not sure which way is wrong [17:40:53] doh, no the first round was better. I casted a value that was supposed to be a float in [0,1] into an int, and that wrecked everything [17:41:02] all better now :) [18:04:34] back, decided to delay lunch a little [18:05:05] uh oh, getting latency alerts again [18:06:18] from the graphs, latencies never really recovered. They've varied a bit but still rough [18:08:47] IO looks plenty reasonable, suggests its CPU limited. none of the hosts look completely maxed (we've seen servers sit at 100% for some time before if it was a bug), but several are at 80%+ , and we know the second half of the cpu usage is worth less than the first half (hyperthreading) [18:09:29] 2087 is the least happy, but not sure if dumping it would help anything [18:09:51] still getting a large number of pool counter rejections from the search bucket too [18:10:32] almost suggests our pool counter allows too much? Ideally the pool counter should reject before the cluster gets into a mode where it has reduced throughput due to load [18:11:04] the problem is pool counter counts each query as equal, but a commonswiki query uses 30+ threads, and a small wiki query uses 1 [18:11:39] worst case, we can manually point morelike at the eqiad cluster until it stabilize [18:11:59] (with a mediawiki deploy, if we wanted to do that from etcd we would hve to define a cluster per use case) [18:12:32] I'm not sure how to decide if we should pull the trigger on that though :S [18:15:58] wow, I thought we had at least some hosts with the performance governor turned on, but it looks like no. Zero hosts in codfw have the performance governor [18:16:54] I'd be in favor of repooling eqiad at the moment, but does that get it anything if none of the traffic actually gets routed there? [18:17:31] we have a way in cirrus to decide which cluster to send traffic to based on query tags, so we can tell it to send all more_like queries to eqiad [18:17:47] working up a config patch [18:17:58] cool, I'm putting together the performance governor ticket now [18:20:10] inflatador: is there a ticket i should attach this to? maybe the switchover one? [18:23:01] ebernhardson One sec, I'll get one started [18:23:32] patch is reasonably simple: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1190737 [18:25:42] ebernhardson ACK, created T405394 [18:25:43] T405394: Point cirrussearch morelike queries to EQIAD - https://phabricator.wikimedia.org/T405394 [18:26:17] looks liek you've already got a a+1 from c-danis [18:34:07] patch is rolling out now [18:37:12] can see queries starting to arrive in eqiad [18:38:45] oh wow, that was fast! [18:39:05] I created T405396 and pinged DC Ops in their IRC room [18:39:06] T405396: Re-enable performance governor on Cirrussearch hosts - https://phabricator.wikimedia.org/T405396 [18:39:47] patch is fully deployed. graphs lag a little, but can see the traffic swapping [18:41:52] looks like we've already shed load from 60% to 50% cpu in the codfw cluster [18:42:47] fulltext qps rising, likely because pool counter has more space now [18:44:54] * ebernhardson ponders is cirrusDumpQuery or cirrusDumpResult should report what cluster will be/was queried [18:45:04] nice, I'm seeing the alerts clear now [18:45:13] right now it's a bit indirect, getting the result dump and checking which cluster has the exact index name [18:51:41] seems pool counter rejections are still going :S [18:53:29] there are a fe whosts that report 20-40% disk utilization, although the total throughput is low. 2062 spikes to 30% disk utilization, but throughput never climbs past 10MB/s. Something seems suspicious there [19:30:29] I'm going to miss pairing yet again this week...sorry in the past someone else was doing afternoon school runs but lately i've needed to [19:38:12] NP, we can move earlier if that's better for you. I'm working on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1190756 ...DC Ops says we're OK to re-enable the governor on select hosts [19:38:39] basically everything here https://gerrit.wikimedia.org/r/c/operations/puppet/+/1131775 , I'm guessing they will eventually let us turn it on everywhere [19:39:40] lunch [19:59:44] latencies seem much more under control [20:03:20] i would be curious to see if the madvise we've been doing for years now is still useful [20:03:39] one suspision is we are doing too many tiny io operations, and we've turned off read-ahead which addresses that issue [20:05:44] although watching `iostat -x 1` on 2062, less clear. %iowait stays very low [20:31:06] i guess my thoughts: Try setting readahead "properly" and disabling the madvise scrpit. Try running the madvise script much more frequently (5m? right now its 30m) [20:36:21] also curious, we have `relatime` set on /srv [20:36:35] that means that every time a file is accessed, ext4 updates the metadata with the last access time [20:43:09] random suggestion that in some cases XFS does better: https://discuss.elastic.co/t/100-io-utilization-after-migrating-to-xfs-from-ext4/353404 [20:45:01] alternatively i also found someone saying (back in 2017): Ext4 filesystem is stable and just works. Avoid XFS like the plague, even though it's slightly faster than ext4 it's got bugs/deadlocks/livelocks which surface on very fast NVMe SSD disks under heavy load (our regular use case). Symptoms are the kernel just busy waiting on IO spinlocks [20:47:21] (separate question of, since we care about IO, we should be considering NVMe disks instead of SSDs. But not clear to me it would make a difference with our low iops and throughput) [20:47:40] i don't know how it is in the enterprise space, but for consumer disks the prices are pretty similar these days [20:54:22] If you wanna try the madvise stuff LMK. I'm game to look at NVME stuff too. I'm just looking at the PSI stuff and not seeing any resource contention (although admittedly I have not looked nearly as closely) [20:56:23] How much of the increased latency (if any) could be explained purely by the network? [20:56:24] NVMe would just be about future orders, a slow upgrade cycle. I guess i would say we should at least quote those on the next upgrade. I suspect the difference between SSD and NVMe might only be 5% or less of the total server price [20:57:23] for the more_like queries being routed to eqiad, thats 30ms per round trip. I think we got that down to 1 round trip, but it's possible there are two [20:57:40] still, that's not a huge amount [20:59:30] i suppose as the indicator of overload though, i'm looking at how consistent the latency numbers are. When things are running well the p50's are pretty flat, when things are having trouble the latencies vary a lot more [21:00:18] (also most of the numbers are looking a lot better in the last hour) [21:00:33] not perfect, but better [21:00:46] at least there's that ;) [21:01:46] i dunno...i gotta do a school run now but will ponder. Trying madvise at every 5 minutes instead of every 30 is probably reasonable to test, but i'd like to see if i can reuse the stuff from the original 2017 investigation to see what readahead is happening on the servers [21:02:02] basically get some proof that we are seeing mixed read-ahead [21:02:04] obviously we have a resource shortage somewhere. I guess we need to put together a plan for benchmarking [21:02:59] inflatador: 3’ [21:04:49] Nothing obvious seems to jump out, but I'm still gonna pull the string on the perf governor stuff [21:04:53] ryankemper np, see ya then! [21:05:03] working on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1190756 [21:48:12] back [21:48:54] (some genius decided we should put a community college, a high school, and a middle school all on the same 1/4mile stretch of road, with not a single stop light, just stop signs, around the area....so traffic is abysmal [21:49:31] adds up to just about 20k students in the area :S [21:52:37] ouch [21:53:00] we merged the perf governor patch, still confirming that it's applied correctly [21:53:13] 22 out of the 55 hosts should have it enabled once it works [21:53:27] nice [21:57:38] yeah, it's applied. So I guess we hope latency gets better? You mentioned P50 is more jittery when the cluster is overloaded [22:08:11] looking at the last hors, it's better but its hard to say. QPS is also declining as we get into a less busy part of the day [22:08:19] I suppose telling would be how it reacts with tomorrows peak load [22:10:09] yeah, I'm not seeing much of a signal either ;( . Also begs the question, what should we expect with only 40% of the fleet changed? I guess I should lay the groundwork for enabling the whole of CODFW