[00:57:22] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Identify tools hosting databases on labsdb100[13] and notify maintainers - https://phabricator.wikimedia.org/T175096#3582626 (10TheDJ) I checked the #tool-erwin_s-tools . The affected data seems to be used by 4 of its tools and is basically all related... [04:48:58] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: db1063 crashed - https://phabricator.wikimedia.org/T180714#3767261 (10Marostegui) reverts for logging and recentchanges on dewiki and wikidatawiki are finished for most of the hosts (still running for db1071 as its hardware isn't as powerful).... [04:52:01] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: db1063 crashed - https://phabricator.wikimedia.org/T180714#3768990 (10Marostegui) archive done doing ipblocks now [04:56:43] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: db1063 crashed - https://phabricator.wikimedia.org/T180714#3768994 (10Marostegui) ipblocks done filearchive done oldimage done protected_titles done servers are now catching up [04:59:17] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: db1063 crashed - https://phabricator.wikimedia.org/T180714#3768998 (10Marostegui) Reminder: we have to enable GTID on the slaves. [05:09:47] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: db1063 crashed - https://phabricator.wikimedia.org/T180714#3769005 (10Marostegui) All the hosts (apart from db1071) are now up to date and ready to be pooled. I have done: https://gerrit.wikimedia.org/r/#/c/391995/2 but I will wait for another... [06:31:23] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3769028 (10Marostegui) [06:46:06] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3769046 (10Marostegui) [06:59:09] jynus: for whenever you are around: https://gerrit.wikimedia.org/r/#/c/391995/ [07:06:10] out for outs? [07:06:45] what? [07:06:52] ah [07:06:59] yeah, i meant for hours [07:08:08] fixed [07:08:56] the vslow is wrong? [07:09:31] db1100 not 1110 ? [07:10:11] yep [07:10:23] good catch, looks like I was still half asleep at that time [07:11:52] done [07:13:50] looks good [07:13:55] great [07:14:04] thanks for the review [07:14:39] question, did you run the alters on db1071? [07:14:44] they are still running [07:14:57] that host isn't as powerful as the others, so it will take longer [07:15:05] ah, makes sense [07:15:17] i will enable gtid later [07:15:24] what about db1063, are we going to abandon that host? [07:15:27] we also have to move dbstore1001 [07:15:43] yeah, i would say we can use it as vslow for s8 or something else [07:15:48] but as a master, i don't trust it anymore [07:20:03] I would rebuild db1063 from db1071 once db1071 is finished, for instance [07:33:50] maybe I should start with binary log disabled? [07:34:12] maybe gtid is so engrained that if there is no binary logs, it fails [07:34:24] or with log_bin=labsdb1010 [07:34:49] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3769053 (10Marostegui) p:05Unbreak!>03Normal Setting back priority to normal as we are back to a normal state now. Pending things: - move dbstore1001 under the new... [07:34:59] let's try the log_bin=labsdb1010 [07:35:07] as it is an easy one to try now, I guess [07:35:24] but it is something to have into account, that multisource+10.1 isn't great when copying data [07:35:39] because with 10.1 without multisource I had no issues when building the new multi-instance hosts [07:35:50] now that I see, not sure it has binary logs enabled [07:36:27] try to boot with the log_bin=labsdb1010, because those are definitely there [07:36:47] it has [07:40:56] it fails too [07:41:06] I am goint to try moving the master-info entriers away [07:41:14] that is what it is complaining about [07:42:08] now it works [07:42:17] so it tries to initialize the slaves [07:42:27] despite the skip [07:42:46] more than that- it fails if the slaves don't initialize, even with the skip [07:44:15] so what did you do? remove the master-info file? [07:44:25] fileS [07:44:44] yes, all the related stuff [07:44:51] there are one per connection [07:44:56] yeah [07:44:58] and it worked? [07:45:01] I think it complains there are no relay [07:45:13] which there are, just with a different name [07:45:36] so that is a thing to have into account [07:45:44] with multisource, the relays fail [07:45:54] but I think they failed before, just not fatally [07:46:38] I am going to start labsdb1010 just to make sure the problem is not ours for trying to start the slaves [07:47:00] maybe it worked on the others because there is not master.info, the replication control is on table [07:47:18] so it fails when GTID is not enabled, not because of multisource [07:47:40] the skip slave start works as intended [07:49:51] ah could be [07:49:52] yeah [07:49:54] that makes sense [07:50:03] "makes sense" [07:50:21] well, yeah "makes sense" [07:59:10] the good news is that it finally works [08:00:20] https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=labsdb1009&var-port=9104 [08:01:01] nice!!! [08:01:10] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3769069 (10Marostegui) [08:01:12] so removing the master files did the trick! [08:01:23] *.info [08:01:50] yes [08:02:07] we should remove all the garbage bin and relay and all that stuff that gets added every time we clone [08:09:47] just wanted to say thanks before I leave for my long weekend <3 [08:09:50] you guys rock [08:10:16] I think looking at the graphs, there was not many api queries failing [08:10:28] https://logstash.wikimedia.org/goto/ffb95c38a179fae2e68963dcbff52797 [08:11:33] and 2000 failures, in the last 24 given the circunstances is pretty good [08:17:49] ema: thanks! have a good weekend! [08:18:12] jynus: that means we can more or less serve s5 with 2 servers, no problem for s8 HW allocation then XD [08:18:40] ha [08:20:29] good morning [08:20:49] from what I gather from T180714 we are in a relatively OK [08:20:49] T180714: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714 [08:21:00] \o/ [08:21:59] I am guessing an incident report is due. I can start it if you want [08:26:25] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3769079 (10Marostegui) [08:26:54] akosiaris: that'd be appreciated, we, dbas, can later give more details after your initial write up [08:27:15] marostegui: ok, I am already writing the first draft [08:27:57] akosiaris: thanks a lot [08:31:49] morning, don't want to be in the way, but if you have anything I can help with, please let me know [08:33:31] thanks volans [08:33:47] we are good now, we are doing some leftovers [08:34:02] https://phabricator.wikimedia.org/T180714 [08:37:37] jynus: i was thinking about starting mysql on db1063 and starting replication on db1071 with start slave until log pos 234464548 and log file db1063-bin.001382 and once there, move it below db1070 [08:38:35] ok, that makes sense [08:39:43] will do it then [08:43:46] We can say good bye to db1063, it is totally broken and doesn't start: https://phabricator.wikimedia.org/P6337 [08:44:30] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3769090 (10Marostegui) db1063 is totally broken and won't start: https://phabricator.wikimedia.org/P6337 [08:45:17] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3769091 (10Marostegui) [08:46:05] we should check for disk errors [08:46:15] in case we want to reuse the host [08:46:27] I mean physical errors [08:46:30] yeah, I would do a full reimage even [08:46:38] Let's leave it like that for now as we need to move dbstore1001 [08:46:41] io hard errors [08:46:41] under db1070 [08:47:05] we will have to do dbstore1001 and db1071 the old way [08:47:22] yeah pff :( [08:47:28] or just rebuild db1071 [08:47:33] it would be nice to keep the binlog for a while [08:47:40] yeah, i am not gonna touch db1063 [08:47:44] until we move those two [08:47:46] I want to give a try to db1071 [08:48:04] it was down when everying happened, in a way it is the best conserved host [08:48:05] if you are up for some archeology today, sure :) [08:48:53] not really [08:49:00] I have the binlog position of the master [08:49:03] and its binlogs [08:49:09] it should be much easier [08:49:24] but it was delayed when all happened, no? [08:50:06] if GTID worked, actually, it would be automatic [08:50:15] but it is not going to work [08:50:15] that'd be a nice test [08:50:25] so I am not going to even try [08:50:28] haha [08:50:35] yeah, it can make things worse [08:51:38] you'll just look for the gtid coordenates from where db1071 stopped and look for the same in db1070 and that should "translate" to the exec_log_pos we need to start it from I guess [08:51:55] I could try with the trick on the jira ticket [08:52:03] but not on this host [08:52:20] I will do that, but I will check with the actual binary log and recentchanges table to check it is correct [08:55:02] GTID 171974884-171974884-1473015127 is the next one (26 is the executed one) [08:55:21] that is confusing because binlogs checks offsets while gtid checks events [08:56:09] that is db1071-bin.005818:455 or db1063-bin.001382:184482382 [08:57:03] next query is UPDATE /* LinksUpdate::updateLinksTimestamp */ on dewiki executed 2017-11-16 16:51:08 [08:57:27] yep, I can confirm those positions and gtid coordenates [08:57:53] let's check where in that position translates for db1070 then [08:58:59] mysqlbinlog db1070-bin.001476 --start-datetime='2017-11-16 16:51:07' | less [08:59:56] end_log_pos 311174141 and end_log_pos 311174214 [09:01:03] It would be offset 311174103 [09:01:35] those should work too [09:01:35] yep, right before the COMMIT [09:02:08] yeah either 311174103 or 311174141 [09:02:10] i would say [09:02:33] 311174103 [09:02:44] if I do 311174141 it skipps the bEGIN [09:03:10] yep [09:03:13] and the gtid stuff [09:03:33] lets check now that the last transaction before that has been executed [09:03:35] yeah, better not to mess with gtid if possible [09:03:39] and the firt one wasn't [09:04:33] so dewiki.page page_id=1733759 should have updated == 20171116073940 [09:05:26] and that is correct [09:06:22] but `wikidatawiki`.`site_stats` should be executed [09:06:50] meaning one field should be 595371787 and not 86 [09:07:26] correct, total edits is 595371787 [09:08:18] sorry my connection froze [09:08:21] reading backlog [09:08:28] so CHANGE MASTER TO db1070-bin.001476:311174103 [09:09:34] double checked the inserts on page and site_stats [09:09:37] and agreed [09:09:50] UPDATE /* SiteStatsUpdate::tryDBUpdateInternal */ `site_stats` SET ss_total_edits=ss_total_edits+1 [09:09:57] its own binlog agrees too [09:10:05] on db1071-bin.005816 [09:10:23] the last binlog written with real data [09:12:30] I am running the change master now [09:12:37] great [09:13:45] oh, we need an alter [09:13:51] which one is missing? [09:13:59] i thought i ran all of them [09:14:00] Column 8 of table 'dewiki.recentchanges' cannot be converted from type 'tinyint' [09:14:00] let me check [09:14:43] right [09:14:45] let me do it [09:14:57] it is only 5GB table [09:14:59] should be faster [09:15:00] is it just missing or is there something else happeiing? [09:15:15] it was missed [09:15:17] like binlog issues because it is before [09:15:24] ok, cool [09:15:28] the column is there [09:15:33] it shouldn't be [09:16:45] just checked, it was the only one missing [09:16:56] running it now [09:18:15] done [09:18:27] started replication [09:19:25] we really need sanitarium split: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=labsdb1010&var-port=9104&from=now-1h&to=now [09:21:25] what? where is that coming from? [09:21:29] ah right [09:21:32] that is labsdb1010 [09:21:38] i got scared [09:21:47] I meant to show only https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=labsdb1010&var-port=9104&from=1510906899368&to=1510910499368&panelId=6&fullscreen [09:23:25] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3769141 (10Marostegui) [09:24:00] i was taking a look at dbstore1001 [09:24:58] that should work for a full day [09:26:09] we should only check that the relay log is updated to the same than the other host were yesterday [09:26:35] and move it to the new host when it reaches it [09:26:50] with events disabled, replication stopped, etc. [09:27:11] I was checking and looks like db1070-bin.001472 1468438682 is around where we'd like to be [09:27:34] ? [09:27:51] am I over complicating things maybe? [09:28:17] not sure what you mean [09:29:07] we have to move dbstore1001 under db1070 so I was taking a quick look at its last gtid executes coordenates, and that matches to GTID 171974884-171974884-1468438682 which is at db1070-bin.001472 1468438682 [09:29:25] sorry db1070-bin.001472 528477941 [09:29:25] ah, it stopped replicating it seems? [09:29:29] yeah, it broke [09:29:57] maybe we can force its start so it catches up from its own relay log and see what happens? [09:29:59] as db1063 is dead [09:30:16] yeah, we could it should end up at the same position as all the other slaves hopefully [09:30:24] and it would be an easy move that way [09:30:27] the io thread broke, but the sql should be able to continue [09:30:27] with events disabled, of course [09:30:28] unless [09:30:35] something else is wrong [09:30:46] if it stops on the same parameter, les thinking [09:30:53] indeed [09:30:56] let's try? [09:31:03] not sure how exactly [09:31:10] can we disable delayed replication [09:31:18] for just 1 thead [09:31:18] ? [09:31:20] probably not [09:31:29] we'd need to: stop all slaves, disable it, start only s5 [09:31:34] the sql_thread i mean [09:31:36] and see what happens [09:31:49] let me research if we can affect only 1 thread [09:32:03] as we are not in a hurry [09:32:14] we can try to start the sql_thread anyways [09:32:20] to see if it even starts [09:32:24] or we are wasting time XD [09:32:27] that is true, too [09:32:32] let me do that [09:33:37] yeah we cannot chose single threads, because it creates the control table with show slave status [09:33:46] looks like it started fine [09:33:48] and I do not want to change the event just for this [09:33:56] will it stop? [09:34:00] we'll see [09:34:07] yeah, we'll see [09:34:23] but that will give us more room [09:34:29] with replication down [09:34:35] it is advancing on the exec log pos, but Seconds behind the master reminds NULL [09:34:49] so we'll see [09:34:59] how does the event control the delay? [09:36:45] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3769161 (10Marostegui) [10:41:37] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3769251 (10Marostegui) I have copied db1063's binlogs over to: ``` root@dbstore1001:/srv/tmp/T180714# ls -lh total 21G -rw-r--r-- 1 root root 21G Nov 17 10:39 db1063_b... [10:42:01] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3769252 (10Marostegui) [10:58:45] 10DBA, 10Patch-For-Review, 10cloud-services-team (Kanban): labsdb1009 crashed - OOM - https://phabricator.wikimedia.org/T179244#3769265 (10jcrespo) 05Open>03Resolved This should be now fixed. [11:02:20] db1071 caught up [11:09:27] what is the simpler way to get a standalone mysql in WMCS Cloud VPS nowadays? any magic puppet role I could use? :-) [11:10:57] db1070 for s5 and db1071 for s8 ? [11:11:06] yeah, that sounds good [11:11:26] volans: mariadb ? [11:11:43] but we do not have a nice "init.pp" for non-production [11:11:54] same for me (mariad or mysql) [11:11:59] which means it needs twwaks [11:12:08] no, I mean the mariadb module [11:12:14] which can install mysql, too [11:12:47] look for the examples of mysql for non production on analytics internal instances or labs recursor [11:12:57] ok, thanks [11:13:26] meaning that you need to enable extra parameters to restart automatically, etc., not the other way around [11:14:10] * volans wonders if apt-get install is easier, I don't need its puppettization really [11:15:24] modules/profile/manifests/openstack/base/pdns/auth/db.pp [11:16:03] I can modify init.pp to do what you want, it is not the first person that ask about it [11:16:15] *you are [11:16:37] and it will be needed for T162070 [11:16:38] T162070: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 [11:18:06] that would be nice, but don't do it for me, there are more urgent/important stuff [11:19:35] technically include mariadb should do what you want, but with so many asterisks, that needs work [12:25:23] 10DBA, 10Wikidata: Migrate wb_items_per_site to using prefixed entity IDs instead of numeric IDs - https://phabricator.wikimedia.org/T114904#3769427 (10Ladsgroup) The query @Multichill is running is complex on its own and even if we resolve this task, it'll be still a very slow query, what I would recommend is... [12:27:50] I enable GTID on db1071.eqiad.wmnet? [12:28:04] and reload the query killers on db1070 [12:28:09] ok? [12:28:10] yep! [12:28:20] well, will do after lunch [12:28:27] enjoy! [13:02:41] 10DBA, 10Operations: Investigate dropping "edit_page_tracking" database table from Wikimedia wikis after archiving it - https://phabricator.wikimedia.org/T57385#3769492 (10Marostegui) [13:04:02] 10DBA, 10Operations: Investigate dropping "edit_page_tracking" database table from Wikimedia wikis after archiving it - https://phabricator.wikimedia.org/T57385#3769495 (10Marostegui) @MZMcBride I assume this table is to be dropped, right? So I can update its entry on T57385 "Removable" row saying "YES" ? [13:07:14] 10DBA, 10Cloud-Services, 10Toolforge: Disabling general.confirmeduser from dbreports for using up too much db resources - https://phabricator.wikimedia.org/T131956#2184155 (10Marostegui) Any objection to close this ticket? It is pretty old, the problematic job was disabled more than a year ago and the old la... [13:11:01] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1109 and db1110 - https://phabricator.wikimedia.org/T180700#3769508 (10Cmjohnson) [13:55:34] 10DBA, 10Cloud-Services, 10Toolforge, 10Tracking: Certain tools users create multiple long running queries that take all memory from labsdb hosts, slowing it down and potentially crashing (tracking) - https://phabricator.wikimedia.org/T119601#3769576 (10chasemp) [13:55:38] 10DBA, 10Cloud-Services, 10Toolforge: Disabling general.confirmeduser from dbreports for using up too much db resources - https://phabricator.wikimedia.org/T131956#3769573 (10chasemp) 05Open>03Resolved a:03chasemp >>! In T131956#3769500, @Marostegui wrote: > Any objection to close this ticket? > It is... [14:10:08] 2 binlogs to reach db1063 for dbstore1001 :) [14:12:17] yeah [14:12:21] 4am 16 [14:12:25] just checked it [14:12:33] I can handle it when it finishes [14:12:42] you may want to go and rest? [14:13:04] yeah, I was thinking about going and taking a nap [14:13:19] I won't be touching much of s5 [14:13:20] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1109 and db1110 - https://phabricator.wikimedia.org/T180700#3769614 (10Cmjohnson) [14:13:27] I will reimage dbproxy1004 [14:14:08] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1109 and db1110 - https://phabricator.wikimedia.org/T180700#3766852 (10Cmjohnson) a:03Marostegui @marostegui These are ready for you [14:14:46] Yesterday I planned to build db1097 in s5 and populate db1101.s5 rc, but obviously that is not possible today, so I will do that tomorrow [14:14:54] And with those two hosts chris just racked, we should be good [14:22:47] 10DBA, 10Operations, 10ops-eqiad: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3769649 (10Marostegui) [14:28:09] jynus: going to take a nap and will be back later to also revert the schema change on dbstore1001 [14:28:20] it was pretty fast on dbstore1002 (thanks toku!) [14:28:25] like 10 minutes in total [14:49:13] 10DBA, 10Analytics, 10Operations, 10Patch-For-Review, 10User-Elukey: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3769759 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dbproxy1004.eqiad.wmnet'] ``` and were **ALL** successful. [15:20:26] 10DBA, 10Performance-Team, 10Wikimedia-log-errors: Frequent "Wikimedia\\Rdbms\\DatabaseMysqlBase::lock failed to acquire lock" errors on WMF mediawiki logs - https://phabricator.wikimedia.org/T180793#3769879 (10jcrespo) The first occurrence found seems to start at 2017-11-03T17:57:04, but that is taken from... [16:21:44] oh dbstore1001 still not there yet [16:23:30] dbstore1002 broke [16:23:33] I am checking it [16:23:47] (replication, nothing too worriyng) [16:24:02] i see… [16:24:08] a completely different table [16:24:16] links* [16:24:21] I am on it, don't worry [16:25:12] I am not too surprised if we see more breakages, given that we are in ROW [16:27:18] also it is a deletion of a non-existent row, so cool, I will check it more thoroguhtly before addint it or skipping [16:27:40] dbstore1001 might fail too? [16:30:11] who knows [16:30:21] tokudb is not reliable [16:30:32] specially, dbstore1002 and dbstore1001 need a reload [16:30:43] yeah, they are begging for one [16:49:12] 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#3770231 (10Marostegui) [16:52:33] dbstore1001 now into the last binlog \o/ [16:52:34] almost done [16:53:37] actually it is bigger than the others, so I guess it will take a couple of hours or so :-( [16:53:53] you know I am around and you should go away, right? [16:54:05] haha [16:54:15] well, that is not totally fair, why should I go and you stay? [16:54:21] I am being serious [16:54:29] it is a waste of time of 2 people doing the same [16:55:13] I will finish something and I will go, I promise [17:08:50] wikidatawiki.recentchanges missing? [17:09:11] on 1001? [17:09:45] not sure why I received a notification [17:09:51] dbstore1001 didn't get any revert yet [17:09:59] notification? [17:10:00] where? [17:10:05] I think it was db1109 confusing my app [17:10:22] everthing is ok [17:10:28] :) [17:10:39] dbstore1001 reached the position [17:11:01] Relay_Master_Log_File: db1063-bin.001382 [17:11:01] Exec_Master_Log_Pos: 234464548 [17:11:22] going to issue the switch [17:11:23] to [17:11:29] 234464548 [17:11:34] the same as the others [17:11:41] master_host='db1070.eqiad.wmnet', master_log_pos=361022356, master_log_file='db1070-bin.001476' [17:11:48] ^ agreed? [17:11:53] we can do the failover to db1070-bin.001476:361022356 [17:12:04] correct [17:12:09] ok, doing it [17:12:18] I copied it from the log yesterday [17:12:45] that should fix dbstore and not create any lag [17:13:05] we have to apply the revert once done, but it shouldn't take long [17:13:09] it was really fast on dbstore1002 [17:13:16] revert? [17:13:18] which one [17:13:27] the schema change [17:13:31] ah [17:13:39] yes, it should be immediate on tokudb? [17:13:52] master switch done [17:13:56] replication broken because of schema change [17:13:59] so going to revert it [17:14:05] (the schema change) [17:15:10] I can se a bunch of recordes having being writen with pt-heartbeat [17:18:07] still altering wikidatawiki.recentchanges [17:21:32] that one took 20 minutes in dbstore1002 [17:22:29] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3770456 (10Marostegui) [17:36:33] 10DBA, 10Analytics, 10Operations, 10Patch-For-Review, 10User-Elukey: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3107358 (10mforns) Re @elukey Seems to me that the DROP DATABASE list is correct. To the list of databases to review I would add: akh... [17:39:43] all done [17:40:00] And the slave starts correctly, but gets stopped by the events, so that is good [17:40:52] 10DBA, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3770570 (10Marostegui) [17:41:12] Now, going to logoff, might get online later to check if all is good [17:41:58] cool [18:08:54] 10DBA, 10Analytics, 10Operations, 10Patch-For-Review, 10User-Elukey: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3770727 (10DarTar) I just spoke to @JAllemandou, we can help review these legacy tables early next week.