[05:02:35] Going to start with pre-failover steps [05:15:24] 10DBA, 10Patch-For-Review: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 (10Marostegui) [05:29:19] 10DBA, 10Patch-For-Review: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 (10Marostegui) [05:36:16] 10DBA, 10Patch-For-Review: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 (10Marostegui) [05:50:16] o/ [05:50:23] o/ [05:50:34] isn't it nice to have the whole morning for you? [05:50:36] you are welcome [05:52:43] * kormat grumbles [05:59:34] 10DBA, 10Patch-For-Review: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 (10Marostegui) [06:04:30] 10DBA, 10Patch-For-Review: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 (10Marostegui) [06:07:33] 10DBA, 10Patch-For-Review: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 (10Marostegui) [06:09:13] 10DBA, 10Patch-For-Review: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 (10Marostegui) [06:11:22] kormat: I remember that db-switchover was modified to include set session binlog_format=ROW; so the master could be updated on zarcillo? But not sure whether it was released or not [06:11:34] It failed this time, I just manually changed it on zarcillo [06:11:41] marostegui: that has not been released yet. i'm slowly working on a release currently [06:11:49] ah ok, that explains it thanks [06:11:53] so it failed successfully in your case [06:13:08] 10DBA: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 (10Marostegui) [06:13:59] I think the only improvement would be on avoiding read errors on topology changes: https://logstash.wikimedia.org/goto/bc911e1c5fff6a211897ba54c54bb024 [06:14:18] which is strange, given there is a pause between each server [06:14:52] maybe the pause should be 2*timeout? [06:15:17] jynus: yeah, but I think the only way to improve that is depooling the hosts I think [06:15:25] fair [06:15:30] 10DBA: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 (10Marostegui) This was done: RO starts: 06:01:05 RO stops: 06:02:17 Total: read-only time: 1:12 minutes [06:15:49] it can be done automatically [06:15:53] it can probably be done, but then we'd need a bit more time to do all the changes, (depool, wait a bit, pool, wait a bit...) [06:16:16] 10DBA: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 (10Marostegui) [06:16:34] I mean, I am thinking 2 seconds pause, it would only take like 7 extra seconds [06:16:35] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [06:16:37] 10DBA, 10SRE: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Marostegui) [06:16:40] 10DBA: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 (10Marostegui) 05Open→03Resolved Thanks everyone for the support! [06:16:59] 10DBA: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 (10Marostegui) [06:16:59] and depools could be automatic [06:17:25] yeah [06:17:38] (just an idea, need work and thought) [06:17:39] depending on the wiki though (s1, s4, s8) the repools might need to be done in small batches [06:17:41] to avoid huge spikes [06:17:50] yeah, just some change in that way [06:18:06] maybe pauses every 1/3 of hosts [06:18:09] yeah [06:18:56] (longer ones) [07:20:35] 10DBA, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1181.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/2021032307... [07:41:36] 10DBA, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1181.eqiad.wmnet'] ` and were **ALL** successful. [08:04:13] s7 is now in orchestrator [08:04:30] 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on all production instances - https://phabricator.wikimedia.org/T268336 (10Marostegui) s7 heartbeat cleaned [08:04:39] 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on all production instances - https://phabricator.wikimedia.org/T268336 (10Marostegui) [08:07:46] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) With the crash that happened on labsdb1009 a couple of weeks ago, report_host is now enabled there. It doesn't really matter as we've excluded labsdb* hosts fr... [08:08:15] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [09:06:56] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [] labsdb1011.eqiad.wmnet:3306 (not needed) [] labsdb1010.eqiad.wmnet:3306 (not needed) [x] labsdb1009.eqiad.wmnet:3306 (not needed) [x] dbstore1003.eqiad.wm... [09:07:16] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [09:12:02] 10DBA: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10Marostegui) [09:13:09] 10DBA: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10Marostegui) [09:13:14] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [09:13:16] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [09:13:29] 10DBA: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10Marostegui) p:05Triage→03Medium [09:13:58] 10DBA: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10Marostegui) [09:14:00] 10Blocked-on-schema-change, 10DBA: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 (10Marostegui) [09:15:42] 10DBA: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10Marostegui) Let's wait for {T276150} to be completed on s1, so we can promote a new master with the new schema. [09:19:44] 10DBA: Switchover s1 from db1083 to db1163 - https://phabricator.wikimedia.org/T278214 (10Marostegui) [09:23:23] 10DBA: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 (10Marostegui) [09:44:22] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1165 pooled with minimal weight for now, once it looks good, I will start the automatic pooling [10:21:46] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [11:21:11] 10DBA, 10SRE, 10ops-eqiad: Degraded RAID on db1086 - https://phabricator.wikimedia.org/T278226 (10Marostegui) a:03wiki_willy @wiki_willy this host will be decommissioned in a few weeks, but I would like this disk to be replaced (it is out of warranty) with some used ones if we still have. This host is a st... [11:21:27] 10DBA, 10SRE, 10ops-eqiad: Degraded RAID on db1086 - https://phabricator.wikimedia.org/T278226 (10Marostegui) p:05Triage→03Medium [12:08:47] 10DBA, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10Marostegui) db1181, replacement for db1086 (candidate master for s7) is now replicating. Let's see how it goes in the next few days. [12:17:44] 10DBA, 10decommission-hardware: decommission db1086.eqiad.wmnet - https://phabricator.wikimedia.org/T278229 (10Marostegui) [12:18:33] 10DBA, 10decommission-hardware: decommission db1086.eqiad.wmnet - https://phabricator.wikimedia.org/T278229 (10Marostegui) This is s7 candidate master. Its replacement is db1181, but let's wait for a week to make sure db1181 performs well before decommissioning db1086. [12:19:04] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [12:19:32] 10DBA, 10Patch-For-Review: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 (10Marostegui) [12:19:34] 10DBA, 10decommission-hardware: decommission db1086.eqiad.wmnet - https://phabricator.wikimedia.org/T278229 (10Marostegui) [12:19:36] 10DBA, 10decommission-hardware: decommission db1086.eqiad.wmnet - https://phabricator.wikimedia.org/T278229 (10Marostegui) [12:19:39] 10DBA, 10SRE, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [14:01:12] 10DBA, 10SRE, 10ops-eqiad: Degraded RAID on db1086 - https://phabricator.wikimedia.org/T278226 (10wiki_willy) a:05wiki_willy→03Cmjohnson Hi @Cmjohnson - since this host is out of warranty, can you grab a drive from a decom'd server for this one? Thanks, Willy [14:31:24] 10DBA: Add *_direct_from to imagelinks and templatelinks - https://phabricator.wikimedia.org/T278236 (10BrandonXLF) [14:32:24] 10DBA, 10Patch-For-Review: Add *_direct_link to imagelinks and templatelinks - https://phabricator.wikimedia.org/T278236 (10BrandonXLF) [15:11:05] 10DBA, 10Patch-For-Review: Add *_direct_link to imagelinks and templatelinks - https://phabricator.wikimedia.org/T278236 (10DannyS712) I've left a -2 on the patch until this change is agreed to by DBAs [16:41:57] 10DBA, 10SRE, 10ops-eqiad: Degraded RAID on db1086 - https://phabricator.wikimedia.org/T278226 (10Cmjohnson) 05Open→03Resolved Disk replaced with a disk from decom'd db host [23:46:56] 10DBA, 10Performance-Team (Radar): Backend Save Timing raised by +80ms at lower percentiles since 23 Mar 2021 - https://phabricator.wikimedia.org/T278274 (10Krinkle) Som cherry-picked entries from Server Admin Log ([Wikitech](https://wikitech.wikimedia.org/w/index.php?title=Server_Admin_Log&oldid=1904929#2021-...