[06:38:42] 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) [06:40:22] 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) MySQL config is the same: ` 12 config differences Variable pc1008 pc1007 ====================... [06:56:39] 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) Raid configuration is the same (included the cache policy): ` --- pc1007.raid 2020-03-17 06:45:54.531009723 +0000 +++ pc1008.raid... [07:07:12] 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) From what I can see on the graphs, none of the hosts reached disk or CPU saturation, but pc1008 did: {F31686354} {F31686356}... [07:20:10] 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) I have also checked that all the FS are mounted with the same options, and they are. [07:30:21] 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) Table fragmentation % is almost the same on pc1007 and pc1008 [07:44:29] 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) I don't see any obvious issues on the host itself or its database itself. Ideas: 1) Upgrade to buster + 10.4 and start testing i... [09:50:45] I am manually retrying backup on codfw s4 [09:51:04] acking alert [09:54:32] marostegui: there is a first change of query patterns for pc at 17:58 [09:54:39] but there is a second [09:54:55] at 19:10 or so [09:56:16] yeah, at second there is a big increase on writes [09:56:30] which is still there [09:56:32] did you see the memcache issue I sent to releng? [09:56:41] I saw those same queries on processlist [09:56:50] I also got a sample of processlist during the issue [09:57:20] it is on cumin1001:~/p1 and p2 [09:57:36] no, where did you send that [09:57:40] checking that [09:57:50] they are full queries [09:57:56] I see them now yeah [09:58:48] I will let you lead reasearch on your own while I fix some extra stuff, will join later [09:58:55] cool [09:58:57] Thanks! [09:59:00] but please look at the memcache issue [09:59:03] yeah [09:59:23] https://phabricator.wikimedia.org/T247562 but that has totally different timestamps [09:59:32] the train of thought is memcache fails to write keys -> goes to disk parseercache -> overload [09:59:49] sure, that is when TTL plays a part [09:59:56] don't know memcache keys ttl, etc. [11:22:31] what do you think about https://phabricator.wikimedia.org/T247787#5974734? [11:24:21] did you compare host metrics? [11:24:34] with another pooled pc [11:24:40] or power settings? [11:25:04] it doesn't necesarilly have to be the host, it could be the key distribution? [11:25:09] in any case + 1 [11:25:34] although remember to set replication from pc1010 at some point, even if we keep it depooled [11:27:07] Yeah, I have compared as many things as I could have thought :( [11:29:54] at least get a feel of the initial cause [11:30:02] e.g. slow queries vs more queries [11:30:15] and feel free to reimage [11:30:24] what version was running there? [11:30:28] (mysql) [11:31:09] did you check syslog for a cron (e.g. purging items) starting or something? [11:32:14] 10.1.43 [11:32:24] same as the other, I imagine? [11:32:28] yeah :( [11:32:38] And no, I didn't check cronjobs, even I suggested it yesterday, going to check it [11:32:44] thanks for the reminder [11:32:57] cron locally or on mwmaint [11:33:30] yeah, the local ones I did check [11:45:29] as expected...nothing really :( [11:46:01] so, I am tempted to go for #2 at https://phabricator.wikimedia.org/T247787#5974734 [11:56:15] as I said, +1 [12:17:16] if you end up reimaging, we can start moving the incident documentation to wikitech [12:17:41] I will do it if you don't do it first when I finish the backup fixing [12:19:57] the other thing I am seeing is a huge increase of disk latency [12:20:00] maybe that is normal [12:20:12] but I would run a disk perf test compared to other idle hosts [12:20:25] to see if we have a controller or disk issue, but not identified by the hw [12:22:12] pc1007 compared to pc1008 increase is why different (not attributed to the mw pattern changes, and whithout io throughput increase) [12:22:23] s/why/way/ [12:25:12] 10DBA, 10Gerrit: Investigate Gerrit troubles to reach the MariaDB database - https://phabricator.wikimedia.org/T247591 (10hashar) I have checked logstash again, the issue has not occurred since March 11. [13:25:23] jynus: the disk performance test is a good ida [13:25:24] idea [13:25:28] I will do that now before reimaging [13:39:27] 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) @jcrespo has suggested to do a disk performance testing just in case there's some sort of performance degradation not revealed by... [14:11:34] 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) So from some tests, it looks like that pc1008's disk do perform worse for some reason: **Random reads pc1007 vs pc1008:** ` fio... [14:11:40] jynus: ^ [14:12:00] both good and bad news [14:12:29] compare to 2007, however [14:12:38] as 1008 may have production impact [14:12:41] sorrt [14:12:44] 1007 [14:13:28] Don't worry, I was monitoring 1007 during the tests [14:13:52] also if mgmt is responsive, a cold restart may reveal hw issues on new restart [14:14:04] not the first time it happens [14:14:43] Interesting [14:14:53] The uptime isnt high though [14:14:55] But worth trying [14:15:06] I mean, at this point- yeah [14:15:27] are those hds or ssds? [14:15:45] there were some announcements of very low performance after some time due to firmware [14:16:16] (bugs) [14:16:32] ssds [14:16:43] I have checked FWs [14:16:47] they are the same [14:20:13] we should have ready a couple of db hosts in case we have to do a quick pool, on pc or elsewere [14:20:22] the test dbs could be those [14:20:51] backup work finished [14:20:55] nice! [14:21:06] I can support you or write the incident report [14:22:22] sure, I was planning on starting it tomororw. Let me do the majority of the stuff, and I will let you know once ready for review/modifications [14:22:33] So you don't waste time onthe initial stuff [14:22:59] you know about the exiting document, right? [14:23:05] yep [14:23:06] *existing [14:23:07] ok [14:23:13] but that should go to wikitech yeah [14:23:17] I will take care of that :) [14:23:28] I was only planning to copy and paste :-D [14:23:43] hahah [16:21:15] 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10wiki_willy) Hi @Marostegui - can you create a dc-ops task for the raid controller replacement? We'll have to pull some logs to send over to... [16:57:31] 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) @wiki_willy keep in mind that I haven't been able to find any logs that shows a RAID controller malfunction unfortunately, it is... [17:00:33] 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10wiki_willy) Hi @Marostegui - we could try RMA'ing it (tho Dell will probably give us a hard time), if all other possibilities have been exhau... [17:04:00] 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) >>! In T247787#5976451, @wiki_willy wrote: > Hi @Marostegui - we could try RMA'ing it (tho Dell will probably give us a hard time... [19:09:55] 10DBA, 10Performance-Team, 10WMF-JobQueue, 10Wikimedia-Rdbms, and 2 others: read only on mediawiki generates "LoadBalancer.php: Cannot access the database: Unknown error" - https://phabricator.wikimedia.org/T218692 (10Krinkle) a:05aaron→03Krinkle [19:11:55] 10DBA, 10Operations, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10wiki_willy) Sure, that works for me @Marostegui . Feel free to shoot open a dc-ops task and assign to @Jclark-ctr . Thanks, Willy [19:24:25] 10DBA, 10Data-Services, 10Quarry: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10Mike_Peel) Looking at CPU usage at https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&from=now-30d&to=now I can't see anything obvious that would explain thing... [20:41:38] 10DBA, 10Data-Services, 10Quarry: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10zhuyifei1999) > @zhuyifei1999 Perhaps there could be some sort of a trusted user set on quarry that can run things for longer? How do you want such a list to be made? I obvio... [21:03:01] 10DBA, 10Data-Services, 10Quarry: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10Mike_Peel) >>! In T246970#5977568, @zhuyifei1999 wrote: >> @zhuyifei1999 Perhaps there could be some sort of a trusted user set on quarry that can run things for longer? > >... [21:53:48] 10DBA, 10Data-Services, 10Quarry: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10zhuyifei1999) > a simple request system at https://www.mediawiki.org/wiki/Talk:Quarry I don't like the idea of flooding a help page with access requests (or perhaps there wil... [21:55:41] 10DBA, 10Data-Services, 10Quarry: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10Mike_Peel) I expect that there would be few requests. Phab would also work. [21:58:08] 10DBA, 10Data-Services, 10Quarry: Quarry: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T246970 (10bd808) >>! In T246970#5977262, @Mike_Peel wrote: > 3. Is there a way to request more direct access to the replicas, ideally with an example of how to run a MySQL query and out...