[05:20:02] 10DBA: labsdb1009:s4 replication broken duplicate entry on commonswiki.wbc_entity_usage - https://phabricator.wikimedia.org/T225390 (10Marostegui) [05:32:44] 10DBA: labsdb1009:s4 replication broken duplicate entry on commonswiki.wbc_entity_usage - https://phabricator.wikimedia.org/T225390 (10Marostegui) This is strange or it is too early for me: This is supposed to be the already existing row: ` mysql:root@localhost [commonswiki]> select * from wbc_entity_usage where... [05:38:32] 10DBA, 10Operations, 10ops-eqiad: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Marostegui) [05:39:15] 10DBA, 10Operations, 10ops-eqiad: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Marostegui) This is s3's sanitarium master, so for now s3 on labs will be lagging until we fix this host [05:41:24] 10DBA, 10Operations, 10ops-eqiad: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Marostegui) p:05Triage→03High @Cmjohnson looks like we have to first upgrade all the firwmare: https://support.hpe.com/hpsc/doc/public/display?docId=mmr_kc-0134828 [05:44:31] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Marostegui) @Cmjohnson I will leave MySQL down so you can upgrade this host's firmwares as soon as you can without waiting for us to stop MySQL [07:42:14] 10DBA: labsdb1009:s4 replication broken duplicate entry on commonswiki.wbc_entity_usage - https://phabricator.wikimedia.org/T225390 (10Marostegui) Replication is now flowing on labsdb1009:s4 [07:45:15] 10DBA: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978 (10Marostegui) labsdb1012 finished the compression on all its tables ` root@labsdb1012:~# df -hT /srv Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/tank-data xfs 14T 5.8T 8.2T 42% /s... [07:45:33] 10DBA: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978 (10Marostegui) [08:36:49] 10DBA: labsdb1009:s4 replication broken duplicate entry on commonswiki.wbc_entity_usage - https://phabricator.wikimedia.org/T225390 (10Marostegui) s4 still catching up [11:18:31] any idea why https://tools.wmflabs.org/replag/ is so high today? [15:07:56] several maintenance going on there (1010) so 1009 and 1011 have a bit of more load, and 1009 broke for s4 early today [18:48:35] 10DBA, 10Operations, 10ops-eqiad: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Cmjohnson) a:05Cmjohnson→03Marostegui I updated with the service pack and powered on...reassigning to @Marostegui [18:53:37] 10DBA, 10Operations, 10ops-eqiad: db1062 (s7 db primary master) disk with predictive failure - https://phabricator.wikimedia.org/T224805 (10Cmjohnson) 05Stalled→03Declined declining this for now since it's out of warranty and the disk has not failed [18:53:39] 10DBA, 10Operations: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Cmjohnson) [19:07:02] 10DBA, 10Operations, 10ops-eqiad: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Marostegui) Thanks @Cmjohnson - I can see that on the logs: ` /system1/log1/record15 Targets Properties number=15 severity=Informational date=06/10/2019 time=16:34 description=Firmware fla... [19:08:49] 10DBA, 10Operations, 10ops-eqiad: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Marostegui) @Cmjohnson can you also check the one of th power supply cable? It might be loose: ` /system1/log1/record17 Targets Properties number=17 severity=Caution date=06/10/2019 time=17:16... [19:12:56] 10DBA: labsdb1009:s4 replication broken duplicate entry on commonswiki.wbc_entity_usage - https://phabricator.wikimedia.org/T225390 (10Marostegui) 05Open→03Resolved a:03Marostegui Replication has been working for around 12 hours now, so I am going to close this as resolved for now. [19:42:43] 10DBA, 10Operations, 10ops-eqiad: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Marostegui) MySQL started correctly, I have upgraded it and started replication as everything looked fine. Once it is up to date, I will run some data checks.