[05:45:57] jynus: ok if I lead the switchover? [05:47:18] sure [05:47:28] good [06:09:51] 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Marostegui) This was done. Read only start: 06:00:36 UTC Read only stop: 06:01:56 UTC Total read only time: 01:20 min [06:21:53] 10DBA, 10Operations, 10ops-codfw: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Marostegui) [06:26:03] 10DBA, 10Operations: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [06:28:49] 10DBA, 10Operations: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [06:29:16] 10DBA, 10Operations: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [07:07:48] 10DBA, 10MediaWiki-extensions-OATHAuth, 10Schema-change: Drop old oathauth_users columns - https://phabricator.wikimedia.org/T226826 (10Marostegui) [07:11:52] 10DBA, 10MediaWiki-extensions-OATHAuth, 10Schema-change: Drop old oathauth_users columns - https://phabricator.wikimedia.org/T226826 (10Marostegui) [07:14:41] 10DBA, 10Operations, 10ops-codfw, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Marostegui) [08:15:20] 10DBA: Failover s8 (wikidatawiki) db primary master db1071 to db1104 (read-only required) - https://phabricator.wikimedia.org/T227062 (10Marostegui) All codfw is now running 10.1.39 (which is the version the new master will run) - will keep upgrading eqiad now. [08:34:17] 10DBA, 10Operations, 10ops-codfw, 10Goal, 10Patch-For-Review: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Marostegui) @RobH @Papaul I have merged: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/520379/ The only changes pending from your side to be abl... [08:47:40] 10DBA: Failover s8 (wikidatawiki) db primary master db1071 to db1104 (read-only required) - https://phabricator.wikimedia.org/T227062 (10Marostegui) [08:47:45] 10DBA, 10Goal: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 (10Marostegui) [08:47:53] 10DBA, 10OTRS, 10Operations, 10Operations-Software-Development, and 2 others: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 (10Marostegui) [09:56:28] marostegui, jynus: any objections to rebooting the dbmonitor hosts now? [09:56:42] let me double check them [09:56:51] oh, dbmonitor, yes [09:56:56] fine by me [09:57:04] I was thinking of cumin [09:57:19] go ahead at any time [09:58:23] ack, doing that in a few minutes [10:00:46] done, tendril is back up [10:00:52] thanks [10:07:43] 10DBA, 10MediaWiki-extensions-OATHAuth, 10Schema-change: Drop old oathauth_users columns - https://phabricator.wikimedia.org/T226826 (10Marostegui) [12:51:10] 10DBA, 10Goal: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 (10Marostegui) [12:51:12] 10DBA, 10Operations: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [12:51:18] 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 9 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Marostegui) 05Open→03Resolved [12:58:41] 10DBA, 10MediaWiki-extensions-OATHAuth, 10Schema-change: Drop old oathauth_users columns - https://phabricator.wikimedia.org/T226826 (10Marostegui) centralauth progress [] labsdb1012 [] labsdb1011 [] labsdb1010 [] labsdb1009 [] dbstore1003 [x] db2120 [x] db2118 [x] db2100 [x] db2095 [x] db2087 [x] db2086 [x... [13:06:15] 10DBA, 10MediaWiki-extensions-OATHAuth, 10Schema-change: Drop old oathauth_users columns - https://phabricator.wikimedia.org/T226826 (10Marostegui) All done [13:06:26] 10DBA, 10MediaWiki-extensions-OATHAuth, 10Schema-change: Drop old oathauth_users columns - https://phabricator.wikimedia.org/T226826 (10Marostegui) 05Open→03Resolved [13:11:57] EDAC syslog errors on db2097 [13:13:11] Where is that? I don't see it on icinga [13:13:23] 10DBA, 10Operations, 10ops-codfw: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 - https://phabricator.wikimedia.org/T225378 (10jcrespo) ` [6498437.928368] mce: [Hardware Error]: Machine check events logged [6498437.928393] EDAC skx MC1: HANDLING MCE MEMORY ERROR [64... [13:13:29] it is a warning, see ^ [13:13:59] Ah yes, look at T225378#5245612 [13:14:00] T225378: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 - https://phabricator.wikimedia.org/T225378 [13:14:05] Look at the last one [13:22:40] I am going to at least restart it [13:24:52] if you upgrade it, that'd be great [13:25:02] and would unblock my s2 codfw failover [13:25:06] I was planning to do someday [13:25:33] Ah, no, never mind, it wasn't db2097 the one blocking my but db2098 :) [13:25:41] ignore me! [13:33:50] 10DBA, 10Operations, 10ops-codfw: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 - https://phabricator.wikimedia.org/T225378 (10jcrespo) ` 462 - Uncorrectable Memory Error Threshold Exceeded (Processor 1, DIMM 3). The DIMM is mapped out and is currently not availabl... [13:44:45] 10DBA, 10Operations, 10ops-codfw: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 - https://phabricator.wikimedia.org/T225378 (10jcrespo) a:05jcrespo→03Papaul a memory stick of db2097 is literally broken: ` root@db2097:~$ free -m total used...