[05:16:20] 10DBA, 10Operations, 10ops-eqiad: db1064 has disk smart error - https://phabricator.wikimedia.org/T206245 (10Marostegui) 05Open>03Resolved a:03Marostegui Thanks! ``` Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Primary-1, Secondary-0, RAID Level... [05:18:38] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T206313 (10Marostegui) This disk has been marked with errors, can we get a different one? ``` Span: 5 - Number of PDs: 2 PD: 0 Information Enclosure Device ID: 32 Slot Number: 10 Drive's position: DiskGroup: 0, Spa... [05:20:31] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T206254 (10Marostegui) The disk failed to rebuild, if this is a brand new disk, can you pull it out wait a couple of minutes and then pull it in back? Thanks! ``` Enclosure Device ID: 32 Slot Number: 3 Drive's posi... [05:32:54] 10DBA, 10Data-Services, 10Patch-For-Review, 10User-Banyek: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983 (10Marostegui) Upstream updated the ticket saying: fix: 3.0.12 which was released a few days ago. I tested it and it is indeed not fixed there (and... [07:06:49] banyek jynus https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/465120/ what do you guys think? [07:07:36] I don't know [07:08:53] 10DBA, 10User-Banyek: BBU problems dbstore2002 - https://phabricator.wikimedia.org/T205257 (10Marostegui) >>! In T205257#4641904, @Papaul wrote: > @Marostegui yes next Thursday works for me. @Papaul let's move this to some other day. Thursday 11th is right after the failover, and we might have some clean up... [07:09:04] The bad BBU is just a performance issue [07:09:13] "just" [07:10:20] So, opinions? [07:10:34] the host is not getting writes directly, so I don't think if anything would change [07:10:39] I'd say lgtm [07:10:49] banyek: But it will have the reads overhead [07:10:49] (and keep an eye on that) [07:11:44] we can revert this really quick if needed [07:12:24] Or put the raid controller to write-back enabled to mitigate any possible lag and then revert [07:14:52] let's say the macine crashes because of reasons, and b/c the enabled write-back the data get corrupted? so what, we reclone it [07:15:40] I mean it is not a master, nor a dedicated master, nor a host with n sections on it [07:16:46] ok, I will pool it, if it lags as soon as it gets reads, let's enable write-back, revert and then disable write back [07:19:33] Maybe the BBU will arrive between and tomorrow, although today is a public holiday in USA so DCOps won't be there [07:19:52] So I think we won't be able to change it [07:20:07] Before the failover, I mean [07:20:38] ok [07:21:05] I start to deploy the wmf-pt-change on labsdb1010 [07:21:47] I disable puppet on labsdb hosts [07:25:11] jynus: let me know when you want to start doing the pending steps for the wikis movement [07:25:24] Once that is done, I would say we enable replication eqiad -> codfw [07:25:56] shouldn't we wait for the rename to the same day? [07:26:10] that way we don't have to replicate a lot of days [07:26:41] we can setup the s5 filter in advance, though [07:27:00] yeah, we can do that now yep [07:27:36] then rename s3 tables the day after tomorrow [07:27:44] sounds good [07:28:02] Enabling eqiad > codfw replication shouldn't matter, do you want to do that today? [07:28:11] shouldn't matter for the table rename, I mean [07:29:17] yes [07:29:40] it does matter a bit, we need an s5 filter [07:29:57] for the s3 tables [07:30:19] yeah, to ignore those wikis [07:32:16] I should start enabling those? [07:35:59] let me help [07:37:00] sure, how do we organize? [07:38:01] you need to disable gtid [07:38:09] on the eqiad master [07:38:24] and then start rep everywhere except s5 [07:38:27] yep :) [07:38:30] which needs extra filters [07:38:43] I will leave s5 for the last [07:39:50] e.g. change db2048 to db1067-bin.001646:717261743 (or whatever) and I can double check before you start slave [07:39:58] great! [07:40:00] let's do that [07:40:14] I am going to disable gtid on all masters (as I said, not touching s5 at all for now) [07:43:21] s1: change 2048 to: db1067-bin.001646 723318262 [07:43:38] did you set it up, should I check now? [07:43:54] No, not yet [07:44:09] I will let you know right before doing start slave; that's better [07:44:16] so you can check host, connection etc [07:44:54] check s1: db2048 [07:45:50] ok, doing [07:46:46] I am not trusting you at nothing, that is why it is taking me so long [07:47:03] please, do so [07:47:09] this is a very critical thing [07:47:18] s1 ok [07:47:23] ideally I would also like banyek to check ;) [07:47:43] starting s1 codfw slave [07:48:48] no issues, right? [07:49:02] nope [07:49:10] going for next one [07:50:01] jynus: check s2: db2035 [07:50:45] I'm not sure what to check [07:50:55] I see replication set up but stopped [07:51:03] banyek: We are enabling replication eqiad -> codfw [07:51:18] i know that [07:51:19] So it would be cool if you can also check what I am doing, just in case :-) [07:51:25] I just don't know what to check ;) [07:51:26] s2 is ok [07:51:44] s2 started [07:51:54] we can talk later about common mistakes [07:52:03] yep, I see it's running now [07:52:31] banyek: basically that the output of show slave status looks good, is the master correct? is the binary log the same that is one running on the current eqiad master? is position "recent"? is ssl enabled?, is the username correct? [07:52:51] replication filters [07:52:52] gtid [07:53:03] wrong master [07:53:07] wrong replica [07:53:18] ok [07:53:49] I wonder if instead of a filter, we could setup a multisource replcia with a filter [07:53:59] so things are kept up to date [07:54:18] but that would need some testing [07:54:34] banyek, jynus check s3: db2043 [07:55:03] checking [07:56:00] mmm is there something wrong? [07:56:37] tendril tree looks weird [07:57:15] banyek: tendril looks funny because of all the replication we had to set up to move the wikis around between s3 and s5, I encourage you to read the document jaime wrote with the plan [07:57:54] I have lag, will come back for you in a second [07:57:59] coolio [07:58:00] no rush [07:58:22] I read it [07:58:27] *was [07:58:42] it just looks weird [07:58:48] in which sense [07:59:00] marostegui: it was ok [07:59:15] you can start replication [07:59:20] jynus: doing it [07:59:48] the tree seems more complex than before - which makes sense [08:00:00] nothing wrong [08:00:08] the problem is that it is not a tree [08:00:11] but a graph [08:00:17] check s4: db2051 [08:00:17] however, that is not shown on the tree [08:00:30] but it is show on the host details [08:01:24] s4 looks good [08:01:33] starting [08:01:49] not doing s5 now [08:02:25] s6 next [08:02:29] check s6: db2039 [08:03:37] s6 looks good [08:03:41] starting [08:04:20] lgtm [08:04:34] check s7: db2040 [08:05:32] s7 is good [08:05:38] starting [08:05:55] yes, it si [08:06:25] check s8: db2045 [08:07:06] seems good [08:07:09] s8 is good [08:07:27] x1, es2, s3 now? [08:07:28] starting [08:07:39] jynus: I thought maybe banyek want to do those? :-) [08:08:21] I can do x1 first [08:08:27] yep [08:08:32] You need to disable GTID first [08:08:45] wait [08:08:53] first I have to find the two hosts to log in [08:09:01] :) [08:09:08] 2034 / 1069 [08:09:11] don't log it, use neodymium and mysql.py, it is faster [08:09:42] Codfw masters (not s5) replication confirmed working [08:09:59] pending s5, x1, es2 and es3 [08:10:13] yes I checked those too after setting it up, but only some second after [08:10:24] Yeah, I just ran a quick check to be sure [08:10:27] Thanks for checking :) [08:12:13] banyek: are you ready? [08:12:22] plz don't rush me [08:12:26] ok ok [08:12:33] :) [08:12:51] do the thing is [08:12:52] you can paste the command here if you like (make sure not to post the password) [08:13:10] so [08:13:13] we disable gtid [08:13:17] where? [08:13:26] set up replication as-is (I mean with a current log pos) [08:13:37] and when we'll enable gtid then the missing parts will be filled? [08:13:42] no [08:13:52] we just disable gtid on the active masters [08:13:53] that's why i dont want to rush :) [08:13:53] nope, we leave gtid disabled till replication is disconnected in codfw [08:14:37] banyek: have you identified the two involved hosts? [08:14:40] because our setup doesn't like multi-masters [08:14:54] the hosts are ok [08:15:01] ? [08:15:01] db2034 and db1069 [08:15:17] banyek: good! so where do you need to disable gtid? [08:15:17] that's the thing I am sure now [08:15:22] what do you mean "the hosts are ok"? [08:15:32] db1069 [08:16:20] good! then you have the first step identified, how will you disable it? [08:16:23] "the hosts are ok" = "I am positive that we are talking about those two hosts" [08:16:34] ah, ok, I didn't undertood that [08:16:34] banyek: be verbose :-) [08:17:28] (I am searching for the command - I never did this before) [08:17:35] sure, take your time [08:17:56] marostegui: maybe we can do es2/3 to speed up the process [08:17:58] `CHANGE MASTER TO MASTER_AUTO_POSITION = 0; ` on master maybe? [08:18:00] and leave x1 untouched [08:18:03] jynus: yep [08:18:15] banyek: let us finish the others [08:18:24] and you keep reading :-) [08:18:29] ok [08:18:33] banyek: do a research on what you'd do and then we finish es2 and es3 meanwhile [08:18:41] ok [08:18:46] we will not touch x1 [08:19:05] marostegui: you research es2, I confirm [08:19:12] jynus: yep, doing it now [08:19:15] Will ping you to check it [08:20:20] jynus: check es2: es2016 [08:20:53] es2 looks correct [08:20:59] starting [08:21:53] jynus: check es3: es2017 [08:23:19] es3 masters not confusing at all [08:23:53] marostegui: start slave @ es3 , looks good [08:24:00] great [08:24:05] done [08:24:07] let's do s5? [08:24:22] ok, I guess [08:24:45] SHOW ALL SLAVES status\G [08:24:55] or you prefer to do it closer to the failover? [08:25:03] we should do this now [08:25:08] ok! [08:25:12] Doing the research [08:25:12] the rename and filters later [08:25:44] there is no harm to do the filters now, isn't it? [08:26:02] ? [08:26:10] I mean the codfw filters on s3 [08:26:16] I meant the filters on s5 :) [08:26:18] you need filters on s5 [08:26:43] the problem is gtid may break [08:27:07] I think `SET GLOBAL GTID_DOMAIN_ID=0` [08:27:16] yeah, we need to disable it on db1070 [08:27:23] banyek: that is not disabling gtid [08:27:48] to be fair, technically gtid is always enabled on mariadb 10.0+ [08:27:49] fine [08:28:03] it is just active or inactive [08:28:05] there's is no gtid_mode variable which I used to use [08:28:18] banyek: remember this is mariadb [08:28:33] I can't forget it [08:28:35] which has its own implementation of gtid [08:28:52] I know, that's why I have no idea what I have to do [08:28:57] jynus: so the idea would be to disable gtid on both threads of db1070 [08:29:01] I never workded nor seen mariadb before here [08:29:08] banyek: just check mariadb doc to see how to disable gtid :-) [08:29:13] marostegui: in theory is is already disabled [08:29:22] as it would fail otherwise [08:29:28] jynus: true, it is multisource [08:29:53] so the filter needed for s5 is to ignore the "new" wikis [08:29:57] so we need to setup repliation as usual [08:30:09] and add a filter on codfw for the new, not existent wikis [08:30:13] correct [08:30:17] and then check gtid doesn't break [08:30:23] which shouldn't [08:30:43] because it doesn't break on the eqiad replicas with gtid enabled [08:31:00] worse case scenario, codfw master breaks [08:31:04] (replication) [08:31:09] which won't affect production [08:32:08] setup the replica and we will quadruple check it [08:32:12] yep [08:32:18] on db2052 [08:33:10] my worry is circular replication + filters shounds scary [08:33:58] jynus: check s5: db2052 [08:34:04] I just find in docs 'how to disable using gtid on slaves when replicating', [08:35:25] banyek: cool [08:35:29] jynus: Replicate_Wild_Ignore_Table: enwikivoyage.%,cebwiki.%,shwiki.%,srwiki.%,mgwiktionary.% [08:35:34] yeah [08:35:41] I am thinking about the heartbeat table [08:35:48] mmm [08:35:59] from s3, if it could be stuck in a circle [08:36:00] might break the sanitariums? [08:36:28] but we don't replicate that, right? [08:36:40] yeah, we skipped that [08:36:42] let me check [08:36:43] so that should be ok [08:37:52] so I can disable gtid on db2034 as `CHANGE MASTER TO MASTER_USE_GTID=OFF` [08:37:58] marostegui: I think to the best of our plan, that should be ok [08:38:08] banyek: we are with you in a sec [08:38:13] okok [08:38:24] jynus: the filters look good, right? [08:38:25] I checked both db1070 [08:38:30] yes [08:38:39] and db2052 [08:38:47] let's do it [08:38:49] let's start replication [08:38:56] done [08:39:18] no complains so far [08:39:23] let me check network [08:39:58] no high spike so far [08:40:05] and no errors from the replicas? [08:41:19] yeah, it looks good [08:42:00] actually, if there is no writes to s3 eqiad for the old wikis, we may not need filters on codfw for s3? [08:42:24] tables will be renamed [08:42:30] so any write will fail actually [08:42:38] yes but only on the master, right? [08:42:38] so yeah, we probably don't need them [08:42:41] yeah [08:42:54] but then we don't need the filters on s3 codfw master [08:42:55] I will need to setup a full plan for the switch [08:43:12] banyek: let's go with x1? [08:43:12] but only to recover the lost edits [08:43:18] jynus: yep [08:43:20] we will think about that later [08:43:33] yep [08:44:12] banyek: let me pull your neck a bit by saying that from "there is nothing to check!" to "I may know what we may have to do" there is a huge leap :-D [08:44:39] it was 'I don't know what to check' [08:44:58] ok, fair [08:45:08] do you have some commads for us? [08:45:10] banyek: That is why I thought it was a good idea to give you one master to do, so you get in full context :-) [08:45:23] marostegui: agree [08:45:34] banyek: also don't worry, if you mess up, you only break all the servers on production for that section! [08:45:43] XDD [08:46:23] do you know the commands? [08:46:29] jynus: I was googled the hell out of 'disabling gtid on mariadb' but found nothing, so I turned to 'setting up gtid for mariadb' :D And I only found how to disable gtid on a slave [08:46:42] banyek: so, is that enough? [08:46:42] ok, that may be it [08:46:49] with `change master to master_use_gtid=off' [08:46:59] banyek: how will you check if that was successful? [08:47:02] as the master always "replicates" gtid [08:47:10] what we want is to disable it as a replica [08:47:25] which is more improtant- where to run it? [08:47:35] and how to check if it was successful [08:47:41] oh [08:47:42] yep, both [08:48:09] well, my guess is the GTID_OP_POS will be empty on the master's show slave status output [08:48:11] hm [08:48:21] guess again :-) [08:48:40] banyek: hint: you might want to compare outputs ;-) [08:50:41] what I am seeing now: db1069 is already set up as a slave of db2034 but gtid_io_pos has gtid's [08:51:00] and there's the using_gtid=slave_pos already set up [08:51:03] anything else interesting on that output? [08:51:04] right [08:51:16] maybe check another host we disabled gtid on? [08:51:19] and check the differences? [08:51:33] e.g. db2034 [08:52:02] on db2034 the show slave status output is empty [08:52:17] so look for another one :) [08:52:21] fair [08:52:48] pn db2045 I just see this two setups, the 'Using_gtid' and the 'GTID_IO_POS' [08:53:15] the other variables seems 'normal' [08:53:21] banyek: what does Using_gtid says? [08:53:47] No || Slave_Pos [08:53:53] there you go, no? [08:53:57] yep [08:54:13] so, what's the plan? [08:54:13] an easier question now [08:54:33] if you just run "change master to master_use_gtid=off" it will fail, why? [08:54:56] because the 'off' != 'no'; [08:55:05] with no it will fail too [08:55:10] which actually makes sense [08:55:12] :-) [08:55:14] because the 'off' != '"no"'; [08:55:22] It will fail with "no" too [08:55:42] ^that is what I really wanted to ask [08:56:52] (that is not mariadb-specific, so you should not that one) [08:56:56] *know [08:56:58] I see this 3 options `MASTER_USE_GTID = {current_pos|slave_pos|no}` [08:57:59] Yeah, those are the mariadb GTID enabling options, but actually, changing to any of those will still fail :-) [08:58:53] as a reminder, we have a meeting in one hour, we want to be done by then :-) [08:59:09] please help, I am really clueless now, so you are just torturing me guys [08:59:16] <3 [08:59:39] any cHANGE MASTER command will fail with replication running [08:59:50] we don't want to stop replication on eqiad [08:59:58] banyek: in order to do change master you need to stop slave and then start it [09:00:06] so we just enclose thecommand with STOP SLAVE; --- ; START SLAVE; [09:00:49] https://www.dropbox.com/s/y8rp0q9edr7ioxw/Screenshot%202018-10-08%2011.00.44.png?dl=0 [09:01:13] banyek: no [09:01:29] banyek: you need to do: "no" instead of "slave_pos" [09:01:59] that is the command to "enable gtid" [09:02:08] so you did have it! [09:02:10] I meant I was know what I can't change that without stopping and wanted to show, (ofc. I was thinking on 'no' now) but I didn't write [09:02:19] but agree, I didn't wrote that [09:02:33] banyek: as I said, be verbose, we don't know what you are thinking :-) [09:02:38] ok, can you run "no" on which server? [09:03:16] I was really just focusing on 'what is the problem with the command itself' which was a mistake indeed [09:03:41] ```STOP SLAVE; CHANGE MASTER TO MASTER_USE_GTID = no; START SLAVE;``` [09:03:50] hm. [09:03:51] on which host? [09:03:52] where? [09:03:52] no. [09:04:00] db1069 [09:04:15] and how will you check it was indeed disabled? [09:04:27] tht will ensure that it will just use binlogs from db2034 [09:04:45] I am not sure I understand what you mean [09:04:47] in the output of 'SHOW SLAVE STATUS\G [09:04:48] ' [09:05:01] gtids [09:05:04] I think I am getting what you are saying [09:05:16] I am a bit stressed no [09:05:18] w [09:05:33] but it is more elaborate than that, it will use the bilong just for replication control [09:05:39] which is what we want [09:05:52] so let me take a big breath, and [09:05:53] ... [09:06:05] So, you've got the host, the command and how to check it, so we are good [09:06:41] on the db1069.eqiad.wmnet host I'll issue the 'STOP SLAVE; CHANGE MASTER TO MASTER_USE_GTID=no; START SLAVE COMMANDS; [09:07:26] after this, I'll able to check the outout of 'SHOW SLAVE STATUS\G' if the using_gtid: no and gtid_io_pos: shows what I need to see [09:07:46] \o/ [09:07:58] cool, then let me do this [09:08:55] the Using_Gtid: No, but the Gtid_OP_Pos: kept the old gtid's [09:09:21] that is expected [09:09:27] which is not the same as I've seen on the db2045 [09:09:36] I think I'll need to clean that up [09:10:08] do i? [09:10:24] sec. I've got a phone call [09:10:24] no, you don't have to clean that up, I can explain you why that Gtid_IO_Pos has so many stuff in there later [09:10:39] it is not easy to cleaup [09:10:49] it gets those from the master [09:10:58] and that neads a binlog purge, which we are not going to do [09:11:01] those are the global_gtid_domain_id :) [09:11:05] there is abug about that [09:11:11] which comes with all the bug report we had to file about multisource [09:11:23] I think I sent you the link, but I can do it later, so you can read it and have great fun [09:12:02] but, let's leave that for later and go with the next step [09:14:09] back [09:15:30] what I am positive for now: the slave thread in connected to the master host [09:17:08] and I see the server_id of db1069 in the output of 'show slave hosts' on db2034 [09:17:24] is gtid disabled? [09:18:17] yes [09:18:21] ```MariaDB [(none)]> pager grep Gtid; show slave status\G pager; [09:18:21] PAGER set to 'grep Gtid' [09:18:21] Using_Gtid: No [09:18:21] Gtid_IO_Pos: 0-171970580-683331037,171966572-171966572-316946218,180363268-180363268-40608909,171970580-171970580-596994206,1-171970580-1,180355159-180355159-103313729,171974681-171974681-198565537 [09:18:21] 1 row in set (0.00 sec) [09:18:21]