[09:50:50] 10DBA, 10Patch-For-Review: Refactor transfer.py - https://phabricator.wikimedia.org/T252172 (10jcrespo) p:05Triage→03Medium Great job here! I looked at every line of the change, and tested it on several runs and it worked nicely. This change, I think, will make further development much easier. You did a lo... [10:49:53] looking at pc1 after the migration, some things seem a bit odd [10:49:59] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=pc1007&var-port=9104&from=now-7d&to=now [10:50:12] the monitoring query latency has gone waaay up [10:50:39] and the "Total InnoDB Memory" line has disappeared from the Memory Usage graph [10:51:32] yeah, this is a know issue [10:51:34] disk latency is also much worse [10:51:37] ah, ok [10:51:42] that is not ok, though [10:51:46] let me see [10:53:09] not only disk latency, latency in general is really bad [10:53:54] ah, you're also looking at "InnoDB Wait Time"? [10:55:19] what is the current pooling status? [10:55:46] I am looking at monitoring latency time, plus other tools to look latency [10:55:46] i'm not sure what that means [10:55:53] lots of connection errors [10:56:16] which come from excessive latency [10:56:37] what is the current pooling state? [10:56:53] i don't know what that means [10:56:56] which hosts are live? [10:57:07] which are configured to serve requests? [10:57:09] pc1007 is the pc1 master for media-wiki [10:57:15] ok, and the others? [10:57:27] 2 and 3, which ones? [10:57:29] they're just replicas, aiui [10:57:47] pc2 -> pc1008. pc3 -> pc1009 [10:57:47] no, I mean who servers pc2 and pc3 sections? [10:57:49] ok [10:57:54] let me compare them [10:58:17] pc2 has been upgraded, pc3 has not [10:58:31] pc1008 is down? [10:59:01] it's up, i'm ssh'd in [10:59:05] oh, I see [10:59:12] it is 10.4 [10:59:19] what version are the others? [10:59:52] pc1007 & pc1008 are on buster with mariadb 10.4 [11:00:00] pc1009 is stretch, with mariadb 10.1 [11:01:54] query latecy looks normal [11:02:01] compared to the other active hosts [11:03:56] buffer pool hit ratio is not great, but normal 95% [11:04:39] so yesterday pc1007 was reimaged? [11:04:50] and today pc1008? [11:05:09] pc1007 was reimaged yesterday, yes [11:05:13] i haven't touched anything today [11:05:16] oh [11:05:19] pc1008 was reimaged some weeks ago [11:05:49] it was put back as pc2 master on 2020-04-16 [11:05:51] do you remember when it was pooled back? [11:05:55] pc1 [11:06:02] 1007 [11:06:09] which time? [11:06:37] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/596173 [11:06:52] i think that's 10:16 UTC [11:06:53] ok, 12:06 [11:07:02] yeah, -2 for utc, right [11:07:18] yeah, it is concerning [11:07:32] it lines up with extra errors that didn't happen before [11:07:50] it is not "site is broken" errors [11:07:56] but "wait an debug" errors [11:08:15] lots of aborted clients on metrics [11:08:26] * kormat nods [11:08:31] https://grafana.wikimedia.org/d/000000273/mysql?panelId=10&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=pc1007&var-port=9104&from=1589344477331&to=1589454389699 [11:08:32] i'm glad i checked [11:08:38] thanks indeed [11:09:56] "InnoDB Row Lock Waits" is up a lot in the same time period. but i don't know if that's cause or effect [11:09:58] strangely, those errors don't appear on mw logs [11:10:15] well, that is not too concerning, it just means there is "lots of activity" [11:10:24] ack [11:10:25] more activity == more locks [11:10:32] that is "normal" [11:10:57] but the aborts, not normal [11:11:26] at 19:33 yesterday we had a dip in hit rate [11:11:53] so I wonder if the current issue is not as much the upgrade as a potential app behaviour change [11:12:00] hmm, right [11:12:07] could be both [11:12:37] there's a "train deploy" line at 19:00 yesterday [11:12:42] look: https://grafana.wikimedia.org/d/000000106/parser-cache?panelId=1&fullscreen&orgId=1&from=1588849955310&to=1589454755311&var-contentModel=wikitext [11:13:02] big dip in performance since around that time [11:13:14] and we know there is an ongoing issue with caching [11:13:42] https://phabricator.wikimedia.org/T247028 [11:13:46] which is an UBN [11:14:06] let's check if we had an issue with metadata server, too [11:14:41] independently of what is the cause, I strongly suggest to pause reimages [11:14:48] of parsercaches [11:14:57] to avoid more factors [11:15:02] define:UBN? [11:15:06] sorry [11:15:07] and agreed re: pausing reimages [11:15:25] Unbreak Now, highest level of priority in a ticket [11:15:38] which means stop what you are doing and fix it [11:15:46] also means "stop the train deployment" [11:15:54] ah hah [11:16:07] this normally related to releng and mediawiki deploys [11:16:24] although sres end up quite familiar with it :-D [11:17:42] right :) [11:20:40] so the concurrency seems normal [11:20:45] on pc1007 [11:21:41] query latency seems normal too [11:22:43] but disk latency seems all over the place [11:23:02] even if the throughput and iops is normal [11:23:36] the last line on the dashboard you sent tells the most important story [11:23:56] it is normal to have an increase in iops and throughput after restart [11:24:03] caches are cold [11:24:25] but latency went from 1-2 seconds to 20 seconds! [11:24:34] yeah. and stayed there [11:24:51] lets compare to the other upgrade [11:24:55] pc1010? [11:25:37] could be our io scheduler changed for buster? [11:26:03] lets also check other upgraded hosts [11:26:21] pc1008 is the other upgraded _master_ [11:26:28] db1107 for example [11:26:37] which has lots of queries [11:27:39] it's disk i/o is 10% of pc1007's [11:27:55] pc1008? [11:28:04] no, db1107 [11:28:14] well, I didn't want to compare them 1:! [11:28:27] because it has more hit ratio and more memory, less data [11:28:34] but if there was a change from before to now [11:28:50] pc1008 also has that change [11:29:25] pc1008 has fairly comparable disk latency and i/o stats to pc1007 [11:29:25] https://grafana.wikimedia.org/d/000000273/mysql?panelId=32&fullscreen&orgId=1&from=1581679760446&to=1589455760446&var-dc=eqiad%20prometheus%2Fops&var-server=pc1008&var-port=9104 [11:29:36] I think the upgrade has to do with it [11:29:47] yep [11:30:51] mmm [11:31:10] cat /sys/block/sda/queue/scheduler [11:31:15] noop [deadline] cfq [11:31:15] disk scheduler has changed, yes [11:31:22] [mq-deadline] none [11:31:32] I don't even know what mq-deadline is [11:32:27] but we may prefer none over it [11:32:57] mq-deadline is the multi-queue form of deadline (for what that's worth) [11:34:31] it doesn't necesarilly has to be the os, could be mysql [11:36:35] it has some performance errors too, they are not that rare on pcs [11:36:56] see pc1009 [11:37:51] I am going to test changing the scheduler on pc1008 [11:38:13] may i suggest we change it on pc1010 instead? [11:38:29] it's a spare in pc1, and it's disk latency is 5s+ [11:38:30] ok, but we will not see a different ther [11:38:39] as it is all serial changes [11:38:52] but ok to test there first [11:39:04] ah i see. ok. let me try that [11:39:15] you do it? [11:39:22] sure [11:39:37] log when done [11:41:11] I will actually let you handle this [11:41:25] ok :) i'll report back with any findings [11:41:32] I will go for lucnch, write everything we discovered on a ticket, either the upgrade [11:41:34] or a new one [11:41:43] and test on non-live hosts [11:42:03] we can test on live ones when I can back (better 2 eyes) :-D [11:42:26] ok? this looks like an important thing- config change on os or db [11:42:49] sure thing [11:43:58] it could be a stupid thing like innodb_flush_log_at_trx commit or something for pcs [11:44:10] will be back later [11:47:31] 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10Kormat) [11:51:15] 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10Kormat) [11:59:32] 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10Kormat) [12:14:28] 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10Kormat) [12:15:38] jynus: i don't think mq-deadline vs deadline makes a difference. and in any case the kernel doesn't offer us 'deadline' on buster for these devices, i suspect because it knows they are multi-queue [12:19:52] 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10Kormat) Running a diff between the `/sys/block/sda/queue` dirs on stretch vs buster gives this: ` --- stretch-pc1009 2020-05-14 14:18:21.659819745 +0200 +++ buster-pc1007 2020-05-14 14:18:21.6558... [12:42:17] I trust you [12:42:23] let's revert the manual changes [12:42:33] and I would like to test my second theory [12:42:49] i've reverted my manual change on pc1010 [12:42:54] but we need 2 hosts that are identical, one on buster and one on stretch [12:43:30] and i would like to run sysbench sql on them [12:43:51] so ideally not pc masters? [12:43:59] yeah, not that [12:44:02] idle ones [12:44:04] depooled [12:44:22] were all the codfw ones upgraded already? [12:45:40] in parsercache sections? pc2009 looks to be the only one not upgraded so far [12:45:40] pc1009 wasn't, we can play with that maybe? [12:45:45] *2009, sorry [12:45:45] sure [12:46:28] and maybe pc2010 for buster? it's a spare, hanging off pc1 currently [12:46:37] that looks right [12:46:42] we would need to stop replication [12:46:54] and downtime it [12:47:08] do those 2 show the same issue on graphs? even if less loaded? [12:47:41] i'll take a look [12:49:27] pc2010 very clearly shows the same disk latency issues. it was upgraded on the 12th [12:49:47] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=1589103008275&to=1589460548391&var-dc=codfw%20prometheus%2Fops&var-server=pc2010&var-port=9104&fullscreen&panelId=32 [12:51:01] one thing is that pc hosts do use os filesystem cache [12:51:10] so writes will likely be larger [12:51:17] and pc2009 looks like it's mostly fine (or at least not nearly as bad as pc2010) [12:51:21] my question is if the metrics change [12:51:32] has an effect on latency of writes [12:51:49] maybe because of the less buffer pool hit ratio [12:52:02] and the less consistent configuration [12:52:25] it really doesn't affect db write latency, even if latency of diks metrics is worse [12:52:31] I would like to test it [12:53:06] basically knowing if what we have on metrics - how does that really affects us [12:53:17] what metrics change are you referring to? [12:53:22] flushing algorithms tend to change from version to version [12:53:43] and as long as the "user percived latency" is the same, we don't care about io latency [12:53:52] the io write latency [12:53:58] is what seems to be worse [12:54:17] I check the aborts, and while they are worse than on normal servers [12:54:24] ah. so you're saying the graphs look worse, but maybe the way the measurement is happening has changed, but the real performance hasn't changed? [12:54:27] they are not that abnormal for pcs hosts [12:54:32] either that [12:54:44] or imagine that now writes happen in larger chunks [12:54:52] latency would show as worse [12:54:59] right [12:55:01] but because also more is being written at once [12:55:04] I don't know [12:55:13] I just want to test it on a more realistic case [12:55:19] do we have a regression on 10.4? [12:55:30] so running a write only benchmark [12:55:34] that is my idea [12:56:01] a write only benchmark on a low memory setup like it is pc* hosts [12:56:27] see if latency is as bad as it looks, then reevaluate [12:56:57] we didn't see huge differences on other hosts [12:57:11] but those other do mostly memory only operatins [12:57:31] and don't use buffered writes [12:57:51] makes sense [12:58:15] i have an onboarding chat in 2 mins (k8s), so i'll be gone for a bit [12:58:22] yeah, don't worry [12:58:31] I think we should work on this with manuel [12:58:57] although may be we can prepare some benchmarking tomorrow [13:08:48] 10DBA: Refactor transfer.py - https://phabricator.wikimedia.org/T252172 (10Privacybatm) > There are some things that I would like you to address before fully closing this ticket: > > * There were non-explicit dependencies still on wmfmariadbpy that should be moved under the RemoteExecution directory: > > ` >... [13:16:39] 10DBA: Refactor transfer.py - https://phabricator.wikimedia.org/T252172 (10jcrespo) > I don't know I understood this correctly. Basically, I suggest to modify: ` def __init__(self, remote_execution) ~~~~~> def __init__(self, target_host, remote_execution): + self.target_host = target_host... [13:55:11] 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10Marostegui) Maybe this is metric's related? The writes tests done on pc1008 showed the same performance as on the rest of pc once the raid was recreated at T247787#5978444 We've not seen any gene... [14:07:44] 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10jcrespo) From what I see we did io benchmarks. I would like to know if real sql queries are affected, maybe MariaDB, on memory-limited hosts with loose disk consistency (pc) now generate more io... [14:13:55] jynus: is manuel always this bad at being on vacation? :) [14:19:33] lol [14:19:37] kormat: Welcome to Wikimedia [14:28:09] 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10Marostegui) >>! In T252761#6137012, @jcrespo wrote: > From what I see we did io benchmarks. I would like to know if real sql queries are affected, maybe MariaDB, on memory-limited hosts with loos... [14:57:41] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10Jclark-ctr) updated ticket with HP. Scheduled main board replacement. [15:22:42] 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10Marostegui) Quick test does show difference on performance on the disks but not on the latency: pc1007: ` root@pc1007:/srv/tmp# ioping -D -i 0.5 . -c 100 -q -L -WWW --- . (xfs /dev/dm-0) iopin... [15:25:27] 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10jcrespo) > Maybe the sysbench result can give us a better picture of how this is affecting mysql query latency itself (if it is really doing so) Yeah, my hope it is a just a metrics/behavior cha... [16:03:27] 10DBA, 10Patch-For-Review: Refactor transfer.py - https://phabricator.wikimedia.org/T252172 (10jcrespo) Not directly related to refactoring, but I though this was very interesting for you in general: I talked to Riccardo (cumin maintainer), and he mention he has plans to fix the configurability of the output... [16:13:04] 10DBA, 10Patch-For-Review: Automate the detection of netcat listen port in transfer.py - https://phabricator.wikimedia.org/T252171 (10jcrespo) Regarding the ss issue: I was able to reproduce this: ` # lsmod | wc -l 90 # netstat -tlpn > /dev/null # lsmod | wc -l 90 # ss -tlpn > /dev/null # lsmod | wc -l 92 `... [17:30:17] 10DBA: Improve output message readabiliy of transfer.py - https://phabricator.wikimedia.org/T252802 (10Privacybatm) [17:32:36] 10DBA, 10Patch-For-Review: Refactor transfer.py - https://phabricator.wikimedia.org/T252172 (10Privacybatm) Glad to hear about the cumin ticket :D I have made a new ticket regarding output message improvement here: T252802. Thank you for supporting this issue :-) [20:07:12] 10DBA, 10Patch-For-Review: Automate the detection of netcat listen port in transfer.py - https://phabricator.wikimedia.org/T252171 (10Privacybatm) Oh okay, Thank you for the update!