[00:57:03] <wikibugs>	 10DBA, 10Goal: Expand database provisioning/backup service to accomodate for growing capacity and high availability needs - https://phabricator.wikimedia.org/T257551 (10Jclark-ctr)
[02:40:27] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on db1089 is CRITICAL: 2.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1089&var-port=9104
[02:44:23] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on db1089 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1089&var-port=9104
[08:34:59] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Kormat) a:05Marostegui→03Kormat
[08:35:07] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Kormat) I'll take this over from the DBA side as manuel is on vacation.
[09:07:48] <jynus>	 T260764 :-(
[09:07:48] <stashbot>	 T260764: backup2001 RAID controller failure - https://phabricator.wikimedia.org/T260764
[09:10:52] <kormat>	 :(
[10:19:52] <jynus>	 I've created the top memory usage on mysql-aggregated: https://grafana.wikimedia.org/d/000000278/mysql-aggregated
[10:20:36] <jynus>	 mmm, I need to fix the links
[10:37:05] <kormat>	 jynus: something weird is going on here: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2125&var-port=9104&from=1597824889052&to=1597832863736
[10:37:22] <kormat>	 i started mariadb (with --skip-slave-start), ran 'start slave'
[10:37:26] <kormat>	 replication started catching up
[10:37:36] <kormat>	 aaand then it stopped running?
[10:38:30] <jynus>	 db2125?
[10:38:35] <jynus>	 let me see
[10:38:35] <kormat>	 yes
[10:38:39] <kormat>	 thanks
[10:39:40] <jynus>	 the thing to check are "Last_IO_Error:" and "Last_SQL_Error:"
[10:39:45] <jynus>	 but there are none
[10:39:55] <jynus>	 which means it was stopped "cleanly"
[10:40:04] <jynus>	 by who or what it is to be determined
[10:40:12] <jynus>	 let me check logs
[10:41:03] <jynus>	 kormat: what time was this? I belive it could have crashed again
[10:41:18] <jynus>	 and you didn't notice it because it was fastly restarted
[10:41:46] <jynus>	 start #1: Aug 19 08:31:45 db2125 systemd[1]: Starting mariadb database server
[10:41:58] <kormat>	 that's when i started it
[10:42:06] <kormat>	 replication stopped around 08:41
[10:42:11] <jynus>	 crash: Aug 19 08:40:26 db2125 systemd[1]: mariadb.service: Main process exited, code=killed, status=6/ABRT
[10:42:23] <jynus>	 start #2: Aug 19 08:40:31 db2125 systemd[1]: Starting mariadb database server...
[10:42:38] <kormat>	 well shit
[10:42:57] <jynus>	 I am guesing it wasn't you who "crashed" it :-D
[10:43:14] <jynus>	 I will let you handle it
[10:43:23] <kormat>	 thanks for spotting that :)
[10:43:24] <jynus>	 my suggestion is to recover from backup
[10:43:35] <jynus>	 just to be 100% sure it is not a data issue
[10:43:46] <jynus>	 then restart, wait for crash
[10:44:00] <jynus>	 also check hw logs, I belive this crashed last time due to hw issues
[10:44:04] <jynus>	 check mgmt logs
[10:44:14] <kormat>	 nothing in the idrac logs for this
[10:46:28] <jynus>	 no new cpu reset like last time?
[10:46:54] <kormat>	 correct
[10:47:58] <jynus>	 one think we have found out is that restarting the server sometimes uncover hidden hw issues
[10:48:02] <jynus>	 consider also doing that
[10:48:35] <jynus>	 downtime the server for an extended period of time like I did with backup2001
[10:53:42] <kormat>	 it's downtimed until the 25th
[10:54:05] <jynus>	 cool then
[10:54:13] <jynus>	 let me know if you need help for recovery
[10:54:46] <kormat>	 cheers
[10:54:53] <kormat>	 i'll reboot it now, and start recovery after lunch
[10:55:07] <jynus>	 yeah, the server is not going anywhere :-D
[10:55:21] <jynus>	 specially in that state
[11:09:12] <jynus>	 when you come back, check also if mysql has been recently updaded to discard mariadb version causes
[12:00:07] <kormat>	 yes actually. 10.4.12-1 -> 10.4.13-1
[12:00:30] <kormat>	 that was yesterday
[12:04:30] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Kormat) I started mariadb, and started replication. Mariadb crashed after about 10 minutes. There's nothing in the idrac logs, so it could well be unrelated to the...
[13:09:07] <jynus>	 BTW, I didn't have a deep look, but log looked strangely familiar to labsdb crashes
[13:22:37] <kormat>	 hmm, right. `[ERROR] InnoDB: Unable to find a record to delete-mark`
[13:30:32] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Kormat) Here's the log from the crash: {P12301}  This looks similar to {T249188}
[16:34:09] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10RobH)
[16:34:16] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10RobH)
[16:43:57] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10RobH)
[16:44:21] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10RobH)
[16:44:31] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10RobH) a:03Papaul
[17:58:45] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr
[18:02:56] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10wiki_willy) a:03Jclark-ctr
[18:40:51] <wikibugs>	 10DBA, 10Operations, 10Parsoid, 10serviceops, and 2 others: update mysql GRANTs for testreduce - https://phabricator.wikimedia.org/T260627 (10Dzahn) 05Open→03Resolved >>! In T260627#6392112, @Kormat wrote: > Hi, i've created the new grants. Please test and let me know if there are any issues. Cheers....