[06:38:42] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui)
[06:40:22] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) MySQL config is the same: ` 12 config differences Variable                  pc1008                    pc1007 ====================...
[06:56:39] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) Raid configuration is the same (included the cache policy): ` --- pc1007.raid 2020-03-17 06:45:54.531009723 +0000 +++ pc1008.raid...
[07:07:12] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) From what I can see on the graphs, none of the hosts reached disk or CPU saturation, but pc1008 did:    {F31686354} {F31686356}...
[07:20:10] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) I have also checked that all the FS are mounted with the same options, and they are.
[07:30:21] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) Table fragmentation % is almost the same on pc1007 and pc1008
[07:44:29] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) I don't see any obvious issues on the host itself or its database itself. Ideas:  1) Upgrade to buster + 10.4 and start testing i...
[09:50:45] <jynus>	 I am manually retrying backup on codfw s4
[09:51:04] <jynus>	 acking alert
[09:54:32] <jynus>	 marostegui: there is a first change of query patterns for pc at 17:58
[09:54:39] <jynus>	 but there is a second
[09:54:55] <jynus>	 at 19:10 or so
[09:56:16] <marostegui>	 yeah, at second there is a big increase on writes
[09:56:30] <marostegui>	 which is still there
[09:56:32] <jynus>	 did you see the memcache issue I sent to releng?
[09:56:41] <jynus>	 I saw those same queries on processlist
[09:56:50] <jynus>	 I also got a sample of processlist during the issue
[09:57:20] <jynus>	 it is on cumin1001:~/p1 and p2
[09:57:36] <marostegui>	 no, where did you send that
[09:57:40] <marostegui>	 checking that
[09:57:50] <jynus>	 they are full queries
[09:57:56] <marostegui>	 I see them now yeah
[09:58:48] <jynus>	 I will let you lead reasearch on your own while I fix some extra stuff, will join later
[09:58:55] <marostegui>	 cool
[09:58:57] <marostegui>	 Thanks!
[09:59:00] <jynus>	 but please look at the memcache issue
[09:59:03] <marostegui>	 yeah
[09:59:23] <marostegui>	 https://phabricator.wikimedia.org/T247562 but that has totally different timestamps
[09:59:32] <jynus>	 the train of thought is memcache fails to write keys -> goes to disk parseercache -> overload
[09:59:49] <jynus>	 sure, that is when TTL plays a part
[09:59:56] <jynus>	 don't know memcache keys ttl, etc.
[11:22:31] <marostegui>	 what do you think about https://phabricator.wikimedia.org/T247787#5974734?
[11:24:21] <jynus>	 did you compare host metrics?
[11:24:34] <jynus>	 with another pooled pc
[11:24:40] <jynus>	 or power settings?
[11:25:04] <jynus>	 it doesn't necesarilly have to be the host, it could be the key distribution?
[11:25:09] <jynus>	 in any case + 1
[11:25:34] <jynus>	 although remember to set replication from pc1010 at some point, even if we keep it depooled
[11:27:07] <marostegui>	 Yeah, I have compared as many things as I could have thought :(
[11:29:54] <jynus>	 at least get a feel of the initial cause
[11:30:02] <jynus>	 e.g. slow queries vs more queries
[11:30:15] <jynus>	 and feel free to reimage
[11:30:24] <jynus>	 what version was running there?
[11:30:28] <jynus>	 (mysql)
[11:31:09] <jynus>	 did you check syslog for a cron (e.g. purging items) starting or something?
[11:32:14] <marostegui>	 10.1.43
[11:32:24] <jynus>	 same as the other, I imagine?
[11:32:28] <marostegui>	 yeah :(
[11:32:38] <marostegui>	 And no, I didn't check cronjobs, even I suggested it yesterday, going to check it
[11:32:44] <marostegui>	 thanks for the reminder
[11:32:57] <jynus>	 cron locally or on mwmaint
[11:33:30] <marostegui>	 yeah, the local ones I did check
[11:45:29] <marostegui>	 as expected...nothing really :(
[11:46:01] <marostegui>	 so, I am tempted to go for #2 at https://phabricator.wikimedia.org/T247787#5974734
[11:56:15] <jynus>	 as I said, +1
[12:17:16] <jynus>	 if you end up reimaging, we can start moving the incident documentation to wikitech
[12:17:41] <jynus>	 I will do it if you don't do it first when I finish the backup fixing
[12:19:57] <jynus>	 the other thing I am seeing is a huge increase of disk latency
[12:20:00] <jynus>	 maybe that is normal
[12:20:12] <jynus>	 but I would run a disk perf test compared to other idle hosts
[12:20:25] <jynus>	 to see if we have a controller or disk issue, but not identified by the hw
[12:22:12] <jynus>	 pc1007 compared to pc1008 increase is why different (not attributed to the mw pattern changes, and whithout io throughput increase)
[12:22:23] <jynus>	 s/why/way/
[12:25:12] <wikibugs>	 10DBA, 10Gerrit: Investigate Gerrit troubles to reach the MariaDB database - https://phabricator.wikimedia.org/T247591 (10hashar) I have checked logstash again, the issue has not occurred since March 11.
[13:25:23] <marostegui>	 jynus: the disk performance test is a good ida
[13:25:24] <marostegui>	 idea
[13:25:28] <marostegui>	 I will do that now before reimaging
[13:39:27] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) @jcrespo has suggested to do a disk performance testing just in case there's some sort of performance degradation not revealed by...
[14:11:34] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) So from some tests, it looks like that pc1008's disk do perform worse for some reason:  **Random reads pc1007 vs pc1008:** ` fio...
[14:11:40] <marostegui>	 jynus: ^
[14:12:00] <jynus>	 both good and bad news
[14:12:29] <jynus>	 compare to 2007, however
[14:12:38] <jynus>	 as 1008 may have production impact
[14:12:41] <jynus>	 sorrt
[14:12:44] <jynus>	 1007
[14:13:28] <marostegui>	 Don't worry, I was monitoring 1007 during the tests
[14:13:52] <jynus>	 also if mgmt is responsive, a cold restart may reveal hw issues on new restart
[14:14:04] <jynus>	 not the first time it happens
[14:14:43] <marostegui>	 Interesting
[14:14:53] <marostegui>	 The uptime isnt high though
[14:14:55] <marostegui>	 But worth trying
[14:15:06] <jynus>	 I mean, at this point- yeah
[14:15:27] <jynus>	 are those hds or ssds?
[14:15:45] <jynus>	 there were some announcements of very low performance after some time due to firmware
[14:16:16] <jynus>	 (bugs)
[14:16:32] <marostegui>	 ssds
[14:16:43] <marostegui>	 I have checked FWs
[14:16:47] <marostegui>	 they are the same
[14:20:13] <jynus>	 we should have ready a couple of db hosts in case we have to do a quick pool, on pc or elsewere
[14:20:22] <jynus>	 the test dbs could be those
[14:20:51] <jynus>	 backup work finished
[14:20:55] <marostegui>	 nice!
[14:21:06] <jynus>	 I can support you or write the incident report
[14:22:22] <marostegui>	 sure, I was planning on starting it tomororw. Let me do the majority of the stuff, and I will let you know once ready for review/modifications
[14:22:33] <marostegui>	 So you don't waste time onthe initial stuff
[14:22:59] <jynus>	 you know about the exiting document, right?
[14:23:05] <marostegui>	 yep
[14:23:06] <jynus>	 *existing
[14:23:07] <jynus>	 ok
[14:23:13] <marostegui>	 but that should go to wikitech yeah
[14:23:17] <marostegui>	 I will take care of that :)
[14:23:28] <jynus>	 I was only planning to copy and paste :-D
[14:23:43] <marostegui>	 hahah
[16:21:15] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10wiki_willy) Hi @Marostegui - can you create a dc-ops task for the raid controller replacement?  We'll have to pull some logs to send over to...
[16:57:31] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) @wiki_willy keep in mind that I haven't been able to find any logs that shows a RAID controller malfunction unfortunately, it is...
[17:00:33] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10wiki_willy) Hi @Marostegui - we could try RMA'ing it (tho Dell will probably give us a hard time), if all other possibilities have been exhau...
[17:04:00] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) >>! In T247787#5976451, @wiki_willy wrote: > Hi @Marostegui - we could try RMA'ing it (tho Dell will probably give us a hard time...
[19:09:55] <wikibugs>	 10DBA, 10Performance-Team, 10WMF-JobQueue, 10Wikimedia-Rdbms, and 2 others: read only on mediawiki generates "LoadBalancer.php: Cannot access the database: Unknown error" - https://phabricator.wikimedia.org/T218692 (10Krinkle) a:05aaron→03Krinkle
[19:11:55] <wikibugs>	 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10wiki_willy) Sure, that works for me @Marostegui .  Feel free to shoot open a dc-ops task and assign to @Jclark-ctr .   Thanks, Willy
[19:24:25] <wikibugs>	 10DBA, 10Data-Services, 10Quarry: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10Mike_Peel) Looking at CPU usage at https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&from=now-30d&to=now I can't see anything obvious that would explain thing...
[20:41:38] <wikibugs>	 10DBA, 10Data-Services, 10Quarry: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10zhuyifei1999) > @zhuyifei1999 Perhaps there could be some sort of a trusted user set on quarry that can run things for longer?  How do you want such a list to be made? I obvio...
[21:03:01] <wikibugs>	 10DBA, 10Data-Services, 10Quarry: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10Mike_Peel) >>! In T246970#5977568, @zhuyifei1999 wrote: >> @zhuyifei1999 Perhaps there could be some sort of a trusted user set on quarry that can run things for longer? >  >...
[21:53:48] <wikibugs>	 10DBA, 10Data-Services, 10Quarry: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10zhuyifei1999) > a simple request system at https://www.mediawiki.org/wiki/Talk:Quarry  I don't like the idea of flooding a help page with access requests (or perhaps there wil...
[21:55:41] <wikibugs>	 10DBA, 10Data-Services, 10Quarry: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10Mike_Peel) I expect that there would be few requests. Phab would also work.
[21:58:08] <wikibugs>	 10DBA, 10Data-Services, 10Quarry: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10bd808) >>! In T246970#5977262, @Mike_Peel wrote: > 3. Is there a way to request more direct access to the replicas, ideally with an example of how to run a MySQL query and out...