[00:57:03] 10DBA, 10Goal: Expand database provisioning/backup service to accomodate for growing capacity and high availability needs - https://phabricator.wikimedia.org/T257551 (10Jclark-ctr) [02:40:27] PROBLEM - MariaDB sustained replica lag on db1089 is CRITICAL: 2.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1089&var-port=9104 [02:44:23] RECOVERY - MariaDB sustained replica lag on db1089 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1089&var-port=9104 [08:34:59] 10DBA, 10Operations, 10ops-codfw, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Kormat) a:05Marostegui→03Kormat [08:35:07] 10DBA, 10Operations, 10ops-codfw, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Kormat) I'll take this over from the DBA side as manuel is on vacation. [09:07:48] T260764 :-( [09:07:48] T260764: backup2001 RAID controller failure - https://phabricator.wikimedia.org/T260764 [09:10:52] :( [10:19:52] I've created the top memory usage on mysql-aggregated: https://grafana.wikimedia.org/d/000000278/mysql-aggregated [10:20:36] mmm, I need to fix the links [10:37:05] jynus: something weird is going on here: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2125&var-port=9104&from=1597824889052&to=1597832863736 [10:37:22] i started mariadb (with --skip-slave-start), ran 'start slave' [10:37:26] replication started catching up [10:37:36] aaand then it stopped running? [10:38:30] db2125? [10:38:35] let me see [10:38:35] yes [10:38:39] thanks [10:39:40] the thing to check are "Last_IO_Error:" and "Last_SQL_Error:" [10:39:45] but there are none [10:39:55] which means it was stopped "cleanly" [10:40:04] by who or what it is to be determined [10:40:12] let me check logs [10:41:03] kormat: what time was this? I belive it could have crashed again [10:41:18] and you didn't notice it because it was fastly restarted [10:41:46] start #1: Aug 19 08:31:45 db2125 systemd[1]: Starting mariadb database server [10:41:58] that's when i started it [10:42:06] replication stopped around 08:41 [10:42:11] crash: Aug 19 08:40:26 db2125 systemd[1]: mariadb.service: Main process exited, code=killed, status=6/ABRT [10:42:23] start #2: Aug 19 08:40:31 db2125 systemd[1]: Starting mariadb database server... [10:42:38] well shit [10:42:57] I am guesing it wasn't you who "crashed" it :-D [10:43:14] I will let you handle it [10:43:23] thanks for spotting that :) [10:43:24] my suggestion is to recover from backup [10:43:35] just to be 100% sure it is not a data issue [10:43:46] then restart, wait for crash [10:44:00] also check hw logs, I belive this crashed last time due to hw issues [10:44:04] check mgmt logs [10:44:14] nothing in the idrac logs for this [10:46:28] no new cpu reset like last time? [10:46:54] correct [10:47:58] one think we have found out is that restarting the server sometimes uncover hidden hw issues [10:48:02] consider also doing that [10:48:35] downtime the server for an extended period of time like I did with backup2001 [10:53:42] it's downtimed until the 25th [10:54:05] cool then [10:54:13] let me know if you need help for recovery [10:54:46] cheers [10:54:53] i'll reboot it now, and start recovery after lunch [10:55:07] yeah, the server is not going anywhere :-D [10:55:21] specially in that state [11:09:12] when you come back, check also if mysql has been recently updaded to discard mariadb version causes [12:00:07] yes actually. 10.4.12-1 -> 10.4.13-1 [12:00:30] that was yesterday [12:04:30] 10DBA, 10Operations, 10ops-codfw, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Kormat) I started mariadb, and started replication. Mariadb crashed after about 10 minutes. There's nothing in the idrac logs, so it could well be unrelated to the... [13:09:07] BTW, I didn't have a deep look, but log looked strangely familiar to labsdb crashes [13:22:37] hmm, right. `[ERROR] InnoDB: Unable to find a record to delete-mark` [13:30:32] 10DBA, 10Operations, 10ops-codfw, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Kormat) Here's the log from the crash: {P12301} This looks similar to {T249188} [16:34:09] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10RobH) [16:34:16] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10RobH) [16:43:57] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10RobH) [16:44:21] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10RobH) [16:44:31] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10RobH) a:03Papaul [17:58:45] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [18:02:56] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10wiki_willy) a:03Jclark-ctr [18:40:51] 10DBA, 10Operations, 10Parsoid, 10serviceops, and 2 others: update mysql GRANTs for testreduce - https://phabricator.wikimedia.org/T260627 (10Dzahn) 05Open→03Resolved >>! In T260627#6392112, @Kormat wrote: > Hi, i've created the new grants. Please test and let me know if there are any issues. Cheers....