[09:50:50] <wikibugs>	 10DBA, 10Patch-For-Review: Refactor transfer.py - https://phabricator.wikimedia.org/T252172 (10jcrespo) p:05Triage→03Medium Great job here! I looked at every line of the change, and tested it on several runs and it worked nicely. This change, I think, will make further development much easier. You did a lo...
[10:49:53] <kormat>	 looking at pc1 after the migration, some things seem a bit odd
[10:49:59] <kormat>	 https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=pc1007&var-port=9104&from=now-7d&to=now
[10:50:12] <kormat>	 the monitoring query latency has gone waaay up
[10:50:39] <kormat>	 and the "Total InnoDB Memory" line has disappeared from the Memory Usage graph
[10:51:32] <jynus>	 yeah, this is a know issue
[10:51:34] <kormat>	 disk latency is also much worse
[10:51:37] <kormat>	 ah, ok
[10:51:42] <jynus>	 that is not ok, though
[10:51:46] <jynus>	 let me see
[10:53:09] <jynus>	 not only disk latency, latency in general is really bad
[10:53:54] <kormat>	 ah, you're also looking at "InnoDB Wait Time"?
[10:55:19] <jynus>	 what is the current pooling status?
[10:55:46] <jynus>	 I am looking at monitoring latency time, plus other tools to look latency
[10:55:46] <kormat>	 i'm not sure what that means
[10:55:53] <jynus>	 lots of connection errors
[10:56:16] <jynus>	 which come from excessive latency
[10:56:37] <jynus>	 what is the current pooling state?
[10:56:53] <kormat>	 i don't know what that means
[10:56:56] <jynus>	 which hosts are live?
[10:57:07] <jynus>	 which are configured to serve requests?
[10:57:09] <kormat>	 pc1007 is the pc1 master for media-wiki
[10:57:15] <jynus>	 ok, and the others?
[10:57:27] <jynus>	 2 and 3, which ones?
[10:57:29] <kormat>	 they're just replicas, aiui
[10:57:47] <kormat>	 pc2 -> pc1008. pc3 -> pc1009
[10:57:47] <jynus>	 no, I mean who servers pc2 and pc3 sections?
[10:57:49] <jynus>	 ok
[10:57:54] <jynus>	 let me compare them
[10:58:17] <kormat>	 pc2 has been upgraded, pc3 has not
[10:58:31] <jynus>	 pc1008 is down?
[10:59:01] <kormat>	 it's up, i'm ssh'd in
[10:59:05] <jynus>	 oh, I see
[10:59:12] <jynus>	 it is 10.4
[10:59:19] <jynus>	 what version are the others?
[10:59:52] <kormat>	 pc1007 & pc1008 are on buster with mariadb 10.4
[11:00:00] <kormat>	 pc1009 is stretch, with mariadb 10.1
[11:01:54] <jynus>	 query latecy looks normal
[11:02:01] <jynus>	 compared to the other active hosts
[11:03:56] <jynus>	 buffer pool hit ratio is not great, but normal 95%
[11:04:39] <jynus>	 so yesterday pc1007 was reimaged?
[11:04:50] <jynus>	 and today pc1008?
[11:05:09] <kormat>	 pc1007 was reimaged yesterday, yes
[11:05:13] <kormat>	 i haven't touched anything today
[11:05:16] <jynus>	 oh
[11:05:19] <kormat>	 pc1008 was reimaged some weeks ago
[11:05:49] <kormat>	 it was put back as pc2 master on 2020-04-16
[11:05:51] <jynus>	 do you remember when it was pooled back?
[11:05:55] <jynus>	 pc1
[11:06:02] <jynus>	 1007
[11:06:09] <jynus>	 which time?
[11:06:37] <kormat>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/596173
[11:06:52] <kormat>	 i think that's 10:16 UTC
[11:06:53] <jynus>	 ok, 12:06
[11:07:02] <jynus>	 yeah, -2 for utc, right
[11:07:18] <jynus>	 yeah, it is concerning
[11:07:32] <jynus>	 it lines up with extra errors that didn't happen before
[11:07:50] <jynus>	 it is not "site is broken" errors
[11:07:56] <jynus>	 but "wait an debug" errors
[11:08:15] <jynus>	 lots of aborted clients on metrics
[11:08:26] * kormat nods
[11:08:31] <jynus>	 https://grafana.wikimedia.org/d/000000273/mysql?panelId=10&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=pc1007&var-port=9104&from=1589344477331&to=1589454389699
[11:08:32] <kormat>	 i'm glad i checked
[11:08:38] <jynus>	 thanks indeed
[11:09:56] <kormat>	 "InnoDB Row Lock Waits" is up a lot in the same time period. but i don't know if that's cause or effect
[11:09:58] <jynus>	 strangely, those errors don't appear on mw logs
[11:10:15] <jynus>	 well, that is not too concerning, it just means there is "lots of activity"
[11:10:24] <kormat>	 ack
[11:10:25] <jynus>	 more activity == more locks
[11:10:32] <jynus>	 that is "normal"
[11:10:57] <jynus>	 but the aborts, not normal
[11:11:26] <jynus>	 at 19:33 yesterday we had a dip in hit rate
[11:11:53] <jynus>	 so I wonder if the current issue is not as much the upgrade as a potential app behaviour change
[11:12:00] <kormat>	 hmm, right
[11:12:07] <jynus>	 could be both
[11:12:37] <kormat>	 there's a "train deploy" line at 19:00 yesterday
[11:12:42] <jynus>	 look: https://grafana.wikimedia.org/d/000000106/parser-cache?panelId=1&fullscreen&orgId=1&from=1588849955310&to=1589454755311&var-contentModel=wikitext
[11:13:02] <jynus>	 big dip in performance since around that time
[11:13:14] <jynus>	 and we know there is an ongoing issue with caching
[11:13:42] <jynus>	 https://phabricator.wikimedia.org/T247028
[11:13:46] <jynus>	 which is an UBN
[11:14:06] <jynus>	 let's check if we had an issue with metadata server, too
[11:14:41] <jynus>	 independently of what is the cause, I strongly suggest to pause reimages
[11:14:48] <jynus>	 of parsercaches
[11:14:57] <jynus>	 to avoid more factors
[11:15:02] <kormat>	 define:UBN?
[11:15:06] <jynus>	 sorry
[11:15:07] <kormat>	 and agreed re: pausing reimages
[11:15:25] <jynus>	 Unbreak Now, highest level of priority in a ticket
[11:15:38] <jynus>	 which means stop what you are doing and fix it
[11:15:46] <jynus>	 also means "stop the train deployment"
[11:15:54] <kormat>	 ah hah
[11:16:07] <jynus>	 this normally related to releng and mediawiki deploys
[11:16:24] <jynus>	 although sres end up quite familiar with it :-D
[11:17:42] <kormat>	 right :)
[11:20:40] <jynus>	 so the concurrency seems normal
[11:20:45] <jynus>	 on pc1007
[11:21:41] <jynus>	 query latency seems normal too
[11:22:43] <jynus>	 but disk latency seems all over the place
[11:23:02] <jynus>	 even if the throughput and iops is normal
[11:23:36] <jynus>	 the last line on the dashboard you sent tells the most important story
[11:23:56] <jynus>	 it is normal to have an increase in iops and throughput after restart
[11:24:03] <jynus>	 caches are cold
[11:24:25] <jynus>	 but latency went from 1-2 seconds to 20 seconds!
[11:24:34] <kormat>	 yeah. and stayed there
[11:24:51] <jynus>	 lets compare to the other upgrade
[11:24:55] <jynus>	 pc1010?
[11:25:37] <jynus>	 could be our io scheduler changed for buster?
[11:26:03] <jynus>	 lets also check other upgraded hosts
[11:26:21] <kormat>	 pc1008 is the other upgraded _master_
[11:26:28] <jynus>	 db1107 for example
[11:26:37] <jynus>	 which has lots of queries
[11:27:39] <kormat>	 it's disk i/o is 10% of pc1007's
[11:27:55] <jynus>	 pc1008?
[11:28:04] <kormat>	 no, db1107
[11:28:14] <jynus>	 well, I didn't want to compare them 1:!
[11:28:27] <jynus>	 because it has more hit ratio and more memory, less data
[11:28:34] <jynus>	 but if there was a change from before to now
[11:28:50] <jynus>	 pc1008 also has that change
[11:29:25] <kormat>	 pc1008 has fairly comparable disk latency and i/o stats to pc1007
[11:29:25] <jynus>	 https://grafana.wikimedia.org/d/000000273/mysql?panelId=32&fullscreen&orgId=1&from=1581679760446&to=1589455760446&var-dc=eqiad%20prometheus%2Fops&var-server=pc1008&var-port=9104
[11:29:36] <jynus>	 I think the upgrade has to do with it
[11:29:47] <kormat>	 yep
[11:30:51] <jynus>	 mmm
[11:31:10] <jynus>	 cat /sys/block/sda/queue/scheduler
[11:31:15] <jynus>	 noop [deadline] cfq
[11:31:15] <kormat>	 disk scheduler has changed, yes
[11:31:22] <jynus>	 [mq-deadline] none
[11:31:32] <jynus>	 I don't even know what mq-deadline is
[11:32:27] <jynus>	 but we may prefer none over it
[11:32:57] <kormat>	 mq-deadline is the multi-queue form of deadline (for what that's worth)
[11:34:31] <jynus>	 it doesn't necesarilly has to be the os, could be mysql
[11:36:35] <jynus>	 it has some performance errors too, they are not that rare on pcs
[11:36:56] <jynus>	 see pc1009
[11:37:51] <jynus>	 I am going to test changing the scheduler on pc1008
[11:38:13] <kormat>	 may i suggest we change it on pc1010 instead?
[11:38:29] <kormat>	 it's a spare in pc1, and it's disk latency is 5s+
[11:38:30] <jynus>	 ok, but we will not see a different ther
[11:38:39] <jynus>	 as it is all serial changes
[11:38:52] <jynus>	 but ok to test there first
[11:39:04] <kormat>	 ah i see. ok. let me try that
[11:39:15] <jynus>	 you do it?
[11:39:22] <kormat>	 sure
[11:39:37] <jynus>	 log when done
[11:41:11] <jynus>	 I will actually let you handle this
[11:41:25] <kormat>	 ok :) i'll report back with any findings
[11:41:32] <jynus>	 I will go for lucnch, write everything we discovered on a ticket, either the upgrade
[11:41:34] <jynus>	 or a new one
[11:41:43] <jynus>	 and test on non-live hosts
[11:42:03] <jynus>	 we can test on live ones when I can back (better 2 eyes) :-D
[11:42:26] <jynus>	 ok? this looks like an important thing- config change on os or db
[11:42:49] <kormat>	 sure thing
[11:43:58] <jynus>	 it could be a stupid thing like innodb_flush_log_at_trx commit or something for pcs
[11:44:10] <jynus>	 will be back later
[11:47:31] <wikibugs>	 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10Kormat)
[11:51:15] <wikibugs>	 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10Kormat)
[11:59:32] <wikibugs>	 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10Kormat)
[12:14:28] <wikibugs>	 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10Kormat)
[12:15:38] <kormat>	 jynus: i don't think mq-deadline vs deadline makes a difference. and in any case the kernel doesn't offer us 'deadline' on buster for these devices, i suspect because it knows they are multi-queue
[12:19:52] <wikibugs>	 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10Kormat) Running a diff between the `/sys/block/sda/queue` dirs on stretch vs buster gives this: ` --- stretch-pc1009 2020-05-14 14:18:21.659819745 +0200 +++ buster-pc1007 2020-05-14 14:18:21.6558...
[12:42:17] <jynus>	 I trust you
[12:42:23] <jynus>	 let's revert the manual changes
[12:42:33] <jynus>	 and I would like to test my second theory
[12:42:49] <kormat>	 i've reverted my manual change on pc1010
[12:42:54] <jynus>	 but we need 2 hosts that are identical, one on buster and one on stretch
[12:43:30] <jynus>	 and i would like to run sysbench sql on them
[12:43:51] <kormat>	 so ideally not pc masters?
[12:43:59] <jynus>	 yeah, not that
[12:44:02] <jynus>	 idle ones
[12:44:04] <jynus>	 depooled
[12:44:22] <jynus>	 were all the codfw ones upgraded already?
[12:45:40] <kormat>	 in parsercache sections? pc2009 looks to be the only one not upgraded so far
[12:45:40] <jynus>	 pc1009 wasn't, we can play with that maybe?
[12:45:45] <jynus>	 *2009, sorry
[12:45:45] <kormat>	 sure
[12:46:28] <kormat>	 and maybe pc2010 for buster? it's a spare, hanging off pc1 currently
[12:46:37] <jynus>	 that looks right
[12:46:42] <jynus>	 we would need to stop replication
[12:46:54] <jynus>	 and downtime it
[12:47:08] <jynus>	 do those 2 show the same issue on graphs? even if less loaded?
[12:47:41] <kormat>	 i'll take a look
[12:49:27] <kormat>	 pc2010 very clearly shows the same disk latency issues. it was upgraded on the 12th
[12:49:47] <kormat>	 https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=1589103008275&to=1589460548391&var-dc=codfw%20prometheus%2Fops&var-server=pc2010&var-port=9104&fullscreen&panelId=32
[12:51:01] <jynus>	 one thing is that pc hosts do use os filesystem cache
[12:51:10] <jynus>	 so writes will likely be larger
[12:51:17] <kormat>	 and pc2009 looks like it's mostly fine (or at least not nearly as bad as pc2010)
[12:51:21] <jynus>	 my question is if the metrics change
[12:51:32] <jynus>	 has an effect on latency of writes
[12:51:49] <jynus>	 maybe because of the less buffer pool hit ratio
[12:52:02] <jynus>	 and the less consistent configuration
[12:52:25] <jynus>	 it really doesn't affect db write latency, even if latency of diks metrics is worse
[12:52:31] <jynus>	 I would like to test it
[12:53:06] <jynus>	 basically knowing if what we have on metrics - how does that really affects us
[12:53:17] <kormat>	 what metrics change are you referring to?
[12:53:22] <jynus>	 flushing algorithms tend to change from version to version
[12:53:43] <jynus>	 and as long as the "user percived latency" is the same, we don't care about io latency
[12:53:52] <jynus>	 the io write latency
[12:53:58] <jynus>	 is what seems to be worse
[12:54:17] <jynus>	 I check the aborts, and while they are worse than on normal servers
[12:54:24] <kormat>	 ah. so you're saying the graphs look worse, but maybe the way the measurement is happening has changed, but the real performance hasn't changed?
[12:54:27] <jynus>	 they are not that abnormal for pcs hosts
[12:54:32] <jynus>	 either that
[12:54:44] <jynus>	 or imagine that now writes happen in larger chunks
[12:54:52] <jynus>	 latency would show as worse
[12:54:59] <kormat>	 right
[12:55:01] <jynus>	 but because also more is being written at once
[12:55:04] <jynus>	 I don't know
[12:55:13] <jynus>	 I just want to test it on a more realistic case
[12:55:19] <jynus>	 do we have a regression on 10.4?
[12:55:30] <jynus>	 so running a write only benchmark
[12:55:34] <jynus>	 that is my idea
[12:56:01] <jynus>	 a write only benchmark on a low memory setup like it is pc* hosts
[12:56:27] <jynus>	 see if latency is as bad as it looks, then reevaluate
[12:56:57] <jynus>	 we didn't see huge differences on other hosts
[12:57:11] <jynus>	 but those other do mostly memory only operatins
[12:57:31] <jynus>	 and don't use buffered writes
[12:57:51] <kormat>	 makes sense
[12:58:15] <kormat>	 i have an onboarding chat in 2 mins (k8s), so i'll be gone for a bit
[12:58:22] <jynus>	 yeah, don't worry
[12:58:31] <jynus>	 I think we should work on this with manuel
[12:58:57] <jynus>	 although may be we can prepare some benchmarking tomorrow
[13:08:48] <wikibugs>	 10DBA: Refactor transfer.py - https://phabricator.wikimedia.org/T252172 (10Privacybatm)  > There are some things that I would like you to address before fully closing this ticket: >  > * There were non-explicit dependencies still on wmfmariadbpy that should be moved under the RemoteExecution directory: >  > ` >...
[13:16:39] <wikibugs>	 10DBA: Refactor transfer.py - https://phabricator.wikimedia.org/T252172 (10jcrespo) > I don't know I understood this correctly.   Basically, I suggest to modify:  ` def __init__(self, remote_execution)                   ~~~~~>   def __init__(self, target_host, remote_execution): + self.target_host = target_host...
[13:55:11] <wikibugs>	 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10Marostegui) Maybe this is metric's related? The writes tests done on pc1008 showed the same performance as on the rest of pc once the raid was recreated at T247787#5978444 We've not seen any gene...
[14:07:44] <wikibugs>	 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10jcrespo) From what I see we did io benchmarks. I would like to know if real sql queries are affected, maybe MariaDB, on memory-limited hosts with loose disk consistency (pc) now generate more io...
[14:13:55] <kormat>	 jynus: is manuel always this bad at being on vacation? :)
[14:19:33] <Reedy>	 lol
[14:19:37] <Reedy>	 kormat: Welcome to Wikimedia
[14:28:09] <wikibugs>	 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10Marostegui) >>! In T252761#6137012, @jcrespo wrote: > From what I see we did io benchmarks. I would like to know if real sql queries are affected, maybe MariaDB, on memory-limited hosts with loos...
[14:57:41] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10Jclark-ctr) updated ticket with HP.  Scheduled main board replacement.
[15:22:42] <wikibugs>	 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10Marostegui) Quick test does show difference on performance on the disks but not on the latency: pc1007: ` root@pc1007:/srv/tmp# ioping  -D -i 0.5 . -c 100 -q -L -WWW   --- . (xfs /dev/dm-0) iopin...
[15:25:27] <wikibugs>	 10DBA: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 (10jcrespo) > Maybe the sysbench result can give us a better picture of how this is affecting mysql query latency itself (if it is really doing so)  Yeah, my hope it is a just a metrics/behavior cha...
[16:03:27] <wikibugs>	 10DBA, 10Patch-For-Review: Refactor transfer.py - https://phabricator.wikimedia.org/T252172 (10jcrespo) Not directly related to refactoring, but I though this was very interesting for you in general:  I talked to Riccardo (cumin maintainer), and he mention he has plans to fix the configurability of the output...
[16:13:04] <wikibugs>	 10DBA, 10Patch-For-Review: Automate the detection of netcat listen port in transfer.py - https://phabricator.wikimedia.org/T252171 (10jcrespo) Regarding the ss issue: I was able to reproduce this:  `  # lsmod | wc -l 90 # netstat -tlpn > /dev/null # lsmod | wc -l 90 # ss -tlpn > /dev/null # lsmod | wc -l 92 `...
[17:30:17] <wikibugs>	 10DBA: Improve output message readabiliy of transfer.py - https://phabricator.wikimedia.org/T252802 (10Privacybatm)
[17:32:36] <wikibugs>	 10DBA, 10Patch-For-Review: Refactor transfer.py - https://phabricator.wikimedia.org/T252172 (10Privacybatm) Glad to hear about the cumin ticket :D I have made a new ticket regarding output message improvement here: T252802. Thank you for supporting this issue :-)
[20:07:12] <wikibugs>	 10DBA, 10Patch-For-Review: Automate the detection of netcat listen port in transfer.py - https://phabricator.wikimedia.org/T252171 (10Privacybatm) Oh okay, Thank you for the update!