[06:28:09] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2039 - https://phabricator.wikimedia.org/T186533#3948243 (10Marostegui) a:03Papaul Hi, Rebuild failed, do you happen to have another disk that we can try? ``` logicaldrive 1 (3.3 TB, RAID 1+0, Interim Recovery Mode) physicaldrive 1I:1:1 (p... [06:28:41] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2039 - https://phabricator.wikimedia.org/T186533#3948248 (10Marostegui) [06:35:01] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3948262 (10Marostegui) [06:35:39] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3877412 (10Marostegui) The last host, s8 master finished the alter table. I am going to start sanitizing sanitarium and labsdb hosts. [06:41:42] 10DBA, 10Data-Services: labsdb1010 crashed - https://phabricator.wikimedia.org/T186579#3948267 (10Marostegui) [06:42:04] 10DBA, 10Data-Services: labsdb1010 crashed - https://phabricator.wikimedia.org/T186579#3948277 (10Marostegui) p:05Triage>03Normal [06:51:13] 10DBA, 10Data-Services: labsdb1010 crashed - https://phabricator.wikimedia.org/T186579#3948292 (10Marostegui) I have fixed the missing row there. Replication is now flowing finely on s4. Time will dictate if we actually need to rebuild this host if more errors show up after the crash. We will see... [09:35:28] 10DBA, 10Patch-For-Review: Prepare and indicate proper master db failover candidates for all database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321#3948405 (10jcrespo) [09:47:12] 10DBA, 10Data-Services: labsdb1010 crashed - https://phabricator.wikimedia.org/T186579#3948267 (10jcrespo) > Time will dictate if we actually need to rebuild this host if more errors show up after the crash. No, that is not how things work- if a database crashes and we do not have gtid enabled, we have to re... [12:32:50] 10DBA, 10Operations, 10ops-eqiad: dbstore1001 crashed - https://phabricator.wikimedia.org/T186596#3948771 (10Marostegui) [12:35:31] 10DBA, 10Operations, 10ops-eqiad: dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller. - https://phabricator.wikimedia.org/T186596#3948773 (10Marostegui) [12:48:54] 10DBA, 10Operations, 10ops-eqiad: dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller. - https://phabricator.wikimedia.org/T186596#3948781 (10Marostegui) ``` Begin: Mounting root file system ... Begin: Running /scripts/loc[ 9.644741] device-mapper: uevent: version 1.0.3 al-top ..... [12:49:51] 10DBA, 10Operations, 10ops-eqiad: dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller. - https://phabricator.wikimedia.org/T186596#3948782 (10Marostegui) ``` [ 171.926534] XFS (dm-0): Mounting V4 Filesystem [ 172.507461] XFS (dm-0): failed to locate log tail [ 172.507464] XFS (dm... [12:50:33] 10DBA, 10Operations, 10ops-eqiad: dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller. - https://phabricator.wikimedia.org/T186596#3948783 (10Marostegui) The LD can be seen finely though - but might be corrupted ``` root@dbstore1001:~# megacli -LDInfo -L0 -a0 Adapter 0 -- Virtual... [13:09:29] 10DBA, 10Operations, 10ops-eqiad: dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller. - https://phabricator.wikimedia.org/T186596#3948795 (10Marostegui) This is what xfs_repair (dry run) shows: ``` root@dbstore1001:~# xfs_repair -n -v /dev/mapper/tank-data Phase 1 - find and verify... [13:22:18] 10DBA, 10Operations, 10ops-eqiad: dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller. - https://phabricator.wikimedia.org/T186596#3948825 (10Marostegui) As read_only the FS can be mounted (only if the recovery is skipped): ``` root@dbstore1001:~# mount -o ro -o norecovery -n /dev/m... [13:27:36] 10DBA, 10Operations, 10ops-eqiad: dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller. - https://phabricator.wikimedia.org/T186596#3948831 (10Marostegui) This is what triggered the crash: ``` Feb 6 12:06:54 dbstore1001 kernel: [10982464.366365] megaraid_sas 0000:03:00.0: Found FW i... [13:27:53] 10DBA, 10Operations, 10ops-eqiad: dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller. - https://phabricator.wikimedia.org/T186596#3948832 (10Marostegui) [13:46:27] 10DBA, 10Operations, 10ops-eqiad: dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller. - https://phabricator.wikimedia.org/T186596#3948856 (10Marostegui) xfs_repair was run, and /srv can be mounted. Some manual writes were good. I have started MySQL to see how it goes with the recov... [13:55:52] 10DBA, 10Operations, 10ops-eqiad: dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller. - https://phabricator.wikimedia.org/T186596#3948857 (10Marostegui) MySQL won't start: ``` InnoDB: Database page corruption on disk or a failed InnoDB: file read of page 23926. InnoDB: You may have... [14:23:11] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3948911 (10Anomie) >>! In T174569#3948262, @Marostegui wrote: > The last host, s8 master finished the alter table. \o/ [14:26:04] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3948915 (10Marostegui) >>! In T174569#3948911, @Anomie wrote: >>>! In T174569#3948262, @Marostegui wrote: >> The last host, s8 maste... [14:57:32] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3949012 (10Anomie) >>! In T174569#3948915, @Marostegui wrote: > I am still placing all the triggers on sanitariums, I will probably... [14:58:19] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3949016 (10Marostegui) >>! In T174569#3949012, @Anomie wrote: >>>! In T174569#3948915, @Marostegui wrote: >> I am still placing all... [14:58:27] 10DBA, 10Operations, 10ops-eqiad: db1051 database host BBU issues - https://phabricator.wikimedia.org/T186049#3949018 (10Marostegui) @Cmjohnson this server is now off. Feel free to power it on once you've done the replacement Thanks! [15:58:10] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2039 - https://phabricator.wikimedia.org/T186533#3949343 (10Papaul) Trying another one. [16:04:41] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2039 - https://phabricator.wikimedia.org/T186533#3949355 (10Papaul) a:05Papaul>03Marostegui Another disk is in place. [17:20:09] 10DBA, 10Operations, 10ops-eqiad: db1051 database host BBU issues - https://phabricator.wikimedia.org/T186049#3949637 (10Marostegui) BBU is now charging Thanks Chris! [17:20:38] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2039 - https://phabricator.wikimedia.org/T186533#3949642 (10Marostegui) Thanks @Papaul - let's hope it goes fine this time! ``` logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 62% complete) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB,... [17:32:30] 10DBA, 10MediaWiki-extensions-Linter, 10Patch-For-Review: Display count of remaining content space errors - https://phabricator.wikimedia.org/T173943#3949688 (10Legoktm) Thanks, we'll make sure that the patch uses the `estimateRowCount` method so the queries aren't slow. [18:57:43] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3950016 (10Cmjohnson) racked in A6 wmf7316 [20:00:25] 10DBA, 10Data-Services, 10Tool-Global-user-contributions, 10Toolforge, 10cloud-services-team (Kanban): Database error: Unable to connect to s7.web.db.svc.eqiad.wmflabs - https://phabricator.wikimedia.org/T182916#3950242 (10Krinkle) @jcrespo I have indeed found some potential db handles not being closed.... [21:33:31] 10DBA, 10Data-Services, 10Tool-Global-user-contributions, 10Toolforge, 10cloud-services-team (Kanban): Database error: Unable to connect to s7.web.db.svc.eqiad.wmflabs - https://phabricator.wikimedia.org/T182916#3950514 (10jcrespo) No, I do not expect you to close the connection after every query- but I...