[05:02:22] 10DBA, 10Operations, 10ops-eqiad: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10Marostegui) So, there data is consistent on main tables ` archive logging page revision text user change_tag actor ipblocks comment ` Going to start repooling this host. [05:12:15] 10DBA: Decommission db2042 - https://phabricator.wikimedia.org/T225090 (10Marostegui) [05:30:09] 10DBA, 10Patch-For-Review: Decommission db2042 - https://phabricator.wikimedia.org/T225090 (10Marostegui) [05:33:50] 10DBA: Decommission old coredb machines (<=db2042) - https://phabricator.wikimedia.org/T221533 (10Marostegui) [05:35:18] 10DBA: BBU issues on codfw - https://phabricator.wikimedia.org/T214264 (10Marostegui) [05:36:22] 10DBA: BBU issues on codfw - https://phabricator.wikimedia.org/T214264 (10Marostegui) 05Open→03Resolved a:03Marostegui All these hosts have been handed over to DCOPs for decommissioning. Closing this. [06:00:30] all backups and recoveries worked as expected [06:00:38] yay!! [06:00:51] I've paused labsdb compression due to bstorm comments [06:00:59] will restart later [06:01:05] cool [06:01:16] I haven't paused labsdb1012 [06:01:30] as it is not used after the initial monthly run (I confirmed with luca) [06:01:44] es backups took around 13-14 hours [06:01:58] oh, less than expected [06:02:25] I remembered 12 or 24 hours before [06:02:30] I guess it was 12 [06:16:31] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10Marostegui) db1091 is fully repooled. I will remove db1135 from s4 after the SRE summit [07:27:20] 10DBA, 10Operations: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 (10Marostegui) All hosts in codfw are now running 10.1.39 so we are ready for the failover from that front. [08:18:22] there is lag on m1 [08:19:22] there was a spike of writes apparently [08:21:08] https://tendril.wikimedia.org/report/slow_queries?host=%5Edb1063&user=&schema=&qmode=eq&query=&hours=1 [08:22:02] strange that caused lag [08:22:11] that is a relatively common operation I would say [10:18:10] after the compression of labs, the next on the list would be db1111 [10:19:10] we can probably delete stuff from it anyways [10:19:13] I can ping the people that uses it [10:20:31] so apparently it is because wikidata in addition of commons [10:20:50] yeah [10:20:56] we added wb_terms a few months ago [10:21:00] for all the normalziation testing [10:21:04] let me ping the guys [10:21:15] ok, no big deal then [10:21:22] yep [14:24:53] jynus and marostegui: would today be a good day for me to run some wiki replica view updates? We are trying to create sub-views of comment and actor :) [14:25:11] bstorm_: labsdb1012 can be done anytime [14:25:29] Fair :) They probably won't use them, but you never know. [14:25:44] They use it at the start of the month, that's what I got from luca [14:25:51] Is compression underway on the others? [14:26:00] bstorm_: I paused maintenance this morning after seeing your comments [14:26:01] I am compressing stuff on 1012, but shouldn't be a problem for you [14:26:07] Great! Thanks! [14:26:13] bstorm_: try 1012 and if it is an issue I will stop it [14:26:19] WIll do [14:26:34] Then I'll get the rest done by the end of the day around the meeting slalom. [14:55:55] Done on 1012. There was no issue [15:00:31] great [15:01:19] please ping on ticket when you are done with the others (no rush, take as much as you need, but let me know when done) [15:18:30] 10DBA, 10Operations, 10ops-eqiad: db1062 (s7 db primary master) disk with predictive failure - https://phabricator.wikimedia.org/T224805 (10Cmjohnson) The server is out of warrant and we will need to order more 600GB disks. [15:49:01] 10DBA, 10Operations, 10ops-eqiad: db1062 (s7 db primary master) disk with predictive failure - https://phabricator.wikimedia.org/T224805 (10jcrespo) I would suggest to take one out of the less important services and replace it here, I will see with @Marostegui where from. [15:57:18] 10DBA, 10Operations, 10ops-eqiad: db1062 (s7 db primary master) disk with predictive failure - https://phabricator.wikimedia.org/T224805 (10jcrespo) The failure is predictictive, it should hold for some time. I suggest to wait for db1068 switch T224852, and once that is resolved use one of its good disks for... [16:58:31] 10DBA, 10Operations, 10ops-eqiad: db1062 (s7 db primary master) disk with predictive failure - https://phabricator.wikimedia.org/T224805 (10Marostegui) Yeah, let's use an used disk to replace this one. And we can schedule s7 failover after s4. The new server is ready in s7 as well. I scheduled s4 first caus... [16:59:19] 10DBA, 10Operations, 10ops-eqiad: db1062 (s7 db primary master) disk with predictive failure - https://phabricator.wikimedia.org/T224805 (10Marostegui) I think we should wait till the disk has fully failed [17:01:42] 10DBA, 10Operations: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10jcrespo) [17:01:46] 10DBA, 10Operations, 10ops-eqiad: db1062 (s7 db primary master) disk with predictive failure - https://phabricator.wikimedia.org/T224805 (10jcrespo) 05Open→03Stalled p:05High→03Normal [17:10:46] https://mariadb.com/resources/blog/innodb-quality-improvements-in-mariadb-server/