[05:16:20] 10DBA, 10Operations, 10ops-eqiad: db1064 has disk smart error - https://phabricator.wikimedia.org/T206245 (10Marostegui) 05Open>03Resolved a:03Marostegui Thanks! ``` Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Primary-1, Secondary-0, RAID Level... [05:18:38] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T206313 (10Marostegui) This disk has been marked with errors, can we get a different one? ``` Span: 5 - Number of PDs: 2 PD: 0 Information Enclosure Device ID: 32 Slot Number: 10 Drive's position: DiskGroup: 0, Spa... [05:20:31] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T206254 (10Marostegui) The disk failed to rebuild, if this is a brand new disk, can you pull it out wait a couple of minutes and then pull it in back? Thanks! ``` Enclosure Device ID: 32 Slot Number: 3 Drive's posi... [05:32:54] 10DBA, 10Data-Services, 10Patch-For-Review, 10User-Banyek: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983 (10Marostegui) Upstream updated the ticket saying: fix: 3.0.12 which was released a few days ago. I tested it and it is indeed not fixed there (and... [07:06:49] banyek jynus https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/465120/ what do you guys think? [07:07:36] I don't know [07:08:53] 10DBA, 10User-Banyek: BBU problems dbstore2002 - https://phabricator.wikimedia.org/T205257 (10Marostegui) >>! In T205257#4641904, @Papaul wrote: > @Marostegui yes next Thursday works for me. @Papaul let's move this to some other day. Thursday 11th is right after the failover, and we might have some clean up... [07:09:04] The bad BBU is just a performance issue [07:09:13] "just" [07:10:20] So, opinions? [07:10:34] the host is not getting writes directly, so I don't think if anything would change [07:10:39] I'd say lgtm [07:10:49] banyek: But it will have the reads overhead [07:10:49] (and keep an eye on that) [07:11:44] we can revert this really quick if needed [07:12:24] Or put the raid controller to write-back enabled to mitigate any possible lag and then revert [07:14:52] let's say the macine crashes because of reasons, and b/c the enabled write-back the data get corrupted? so what, we reclone it [07:15:40] I mean it is not a master, nor a dedicated master, nor a host with n sections on it [07:16:46] ok, I will pool it, if it lags as soon as it gets reads, let's enable write-back, revert and then disable write back [07:19:33] Maybe the BBU will arrive between and tomorrow, although today is a public holiday in USA so DCOps won't be there [07:19:52] So I think we won't be able to change it [07:20:07] Before the failover, I mean [07:20:38] ok [07:21:05] I start to deploy the wmf-pt-change on labsdb1010 [07:21:47] I disable puppet on labsdb hosts [07:25:11] jynus: let me know when you want to start doing the pending steps for the wikis movement [07:25:24] Once that is done, I would say we enable replication eqiad -> codfw [07:25:56] shouldn't we wait for the rename to the same day? [07:26:10] that way we don't have to replicate a lot of days [07:26:41] we can setup the s5 filter in advance, though [07:27:00] yeah, we can do that now yep [07:27:36] then rename s3 tables the day after tomorrow [07:27:44] sounds good [07:28:02] Enabling eqiad > codfw replication shouldn't matter, do you want to do that today? [07:28:11] shouldn't matter for the table rename, I mean [07:29:17] yes [07:29:40] it does matter a bit, we need an s5 filter [07:29:57] for the s3 tables [07:30:19] yeah, to ignore those wikis [07:32:16] I should start enabling those? [07:35:59] let me help [07:37:00] sure, how do we organize? [07:38:01] you need to disable gtid [07:38:09] on the eqiad master [07:38:24] and then start rep everywhere except s5 [07:38:27] yep :) [07:38:30] which needs extra filters [07:38:43] I will leave s5 for the last [07:39:50] e.g. change db2048 to db1067-bin.001646:717261743 (or whatever) and I can double check before you start slave [07:39:58] great! [07:40:00] let's do that [07:40:14] I am going to disable gtid on all masters (as I said, not touching s5 at all for now) [07:43:21] s1: change 2048 to: db1067-bin.001646 723318262 [07:43:38] did you set it up, should I check now? [07:43:54] No, not yet [07:44:09] I will let you know right before doing start slave; that's better [07:44:16] so you can check host, connection etc [07:44:54] check s1: db2048 [07:45:50] ok, doing [07:46:46] I am not trusting you at nothing, that is why it is taking me so long [07:47:03] please, do so [07:47:09] this is a very critical thing [07:47:18] s1 ok [07:47:23] ideally I would also like banyek to check ;) [07:47:43] starting s1 codfw slave [07:48:48] no issues, right? [07:49:02] nope [07:49:10] going for next one [07:50:01] jynus: check s2: db2035 [07:50:45] I'm not sure what to check [07:50:55] I see replication set up but stopped [07:51:03] banyek: We are enabling replication eqiad -> codfw [07:51:18] i know that [07:51:19] So it would be cool if you can also check what I am doing, just in case :-) [07:51:25] I just don't know what to check ;) [07:51:26] s2 is ok [07:51:44] s2 started [07:51:54] we can talk later about common mistakes [07:52:03] yep, I see it's running now [07:52:31] banyek: basically that the output of show slave status looks good, is the master correct? is the binary log the same that is one running on the current eqiad master? is position "recent"? is ssl enabled?, is the username correct? [07:52:51] replication filters [07:52:52] gtid [07:53:03] wrong master [07:53:07] wrong replica [07:53:18] ok [07:53:49] I wonder if instead of a filter, we could setup a multisource replcia with a filter [07:53:59] so things are kept up to date [07:54:18] but that would need some testing [07:54:34] banyek, jynus check s3: db2043 [07:55:03] checking [07:56:00] mmm is there something wrong? [07:56:37] tendril tree looks weird [07:57:15] banyek: tendril looks funny because of all the replication we had to set up to move the wikis around between s3 and s5, I encourage you to read the document jaime wrote with the plan [07:57:54] I have lag, will come back for you in a second [07:57:59] coolio [07:58:00] no rush [07:58:22] I read it [07:58:27] *was [07:58:42] it just looks weird [07:58:48] in which sense [07:59:00] marostegui: it was ok [07:59:15] you can start replication [07:59:20] jynus: doing it [07:59:48] the tree seems more complex than before - which makes sense [08:00:00] nothing wrong [08:00:08] the problem is that it is not a tree [08:00:11] but a graph [08:00:17] check s4: db2051 [08:00:17] however, that is not shown on the tree [08:00:30] but it is show on the host details [08:01:24] s4 looks good [08:01:33] starting [08:01:49] not doing s5 now [08:02:25] s6 next [08:02:29] check s6: db2039 [08:03:37] s6 looks good [08:03:41] starting [08:04:20] lgtm [08:04:34] check s7: db2040 [08:05:32] s7 is good [08:05:38] starting [08:05:55] yes, it si [08:06:25] check s8: db2045 [08:07:06] seems good [08:07:09] s8 is good [08:07:27] x1, es2, s3 now? [08:07:28] starting [08:07:39] jynus: I thought maybe banyek want to do those? :-) [08:08:21] I can do x1 first [08:08:27] yep [08:08:32] You need to disable GTID first [08:08:45] wait [08:08:53] first I have to find the two hosts to log in [08:09:01] :) [08:09:08] 2034 / 1069 [08:09:11] don't log it, use neodymium and mysql.py, it is faster [08:09:42] Codfw masters (not s5) replication confirmed working [08:09:59] pending s5, x1, es2 and es3 [08:10:13] yes I checked those too after setting it up, but only some second after [08:10:24] Yeah, I just ran a quick check to be sure [08:10:27] Thanks for checking :) [08:12:13] banyek: are you ready? [08:12:22] plz don't rush me [08:12:26] ok ok [08:12:33] :) [08:12:51] do the thing is [08:12:52] you can paste the command here if you like (make sure not to post the password) [08:13:10] so [08:13:13] we disable gtid [08:13:17] where? [08:13:26] set up replication as-is (I mean with a current log pos) [08:13:37] and when we'll enable gtid then the missing parts will be filled? [08:13:42] no [08:13:52] we just disable gtid on the active masters [08:13:53] that's why i dont want to rush :) [08:13:53] nope, we leave gtid disabled till replication is disconnected in codfw [08:14:37] banyek: have you identified the two involved hosts? [08:14:40] because our setup doesn't like multi-masters [08:14:54] the hosts are ok [08:15:01] ? [08:15:01] db2034 and db1069 [08:15:17] banyek: good! so where do you need to disable gtid? [08:15:17] that's the thing I am sure now [08:15:22] what do you mean "the hosts are ok"? [08:15:32] db1069 [08:16:20] good! then you have the first step identified, how will you disable it? [08:16:23] "the hosts are ok" = "I am positive that we are talking about those two hosts" [08:16:34] ah, ok, I didn't undertood that [08:16:34] banyek: be verbose :-) [08:17:28] (I am searching for the command - I never did this before) [08:17:35] sure, take your time [08:17:56] marostegui: maybe we can do es2/3 to speed up the process [08:17:58] `CHANGE MASTER TO MASTER_AUTO_POSITION = 0; ` on master maybe? [08:18:00] and leave x1 untouched [08:18:03] jynus: yep [08:18:15] banyek: let us finish the others [08:18:24] and you keep reading :-) [08:18:29] ok [08:18:33] banyek: do a research on what you'd do and then we finish es2 and es3 meanwhile [08:18:41] ok [08:18:46] we will not touch x1 [08:19:05] marostegui: you research es2, I confirm [08:19:12] jynus: yep, doing it now [08:19:15] Will ping you to check it [08:20:20] jynus: check es2: es2016 [08:20:53] es2 looks correct [08:20:59] starting [08:21:53] jynus: check es3: es2017 [08:23:19] es3 masters not confusing at all [08:23:53] marostegui: start slave @ es3 , looks good [08:24:00] great [08:24:05] done [08:24:07] let's do s5? [08:24:22] ok, I guess [08:24:45] SHOW ALL SLAVES status\G [08:24:55] or you prefer to do it closer to the failover? [08:25:03] we should do this now [08:25:08] ok! [08:25:12] Doing the research [08:25:12] the rename and filters later [08:25:44] there is no harm to do the filters now, isn't it? [08:26:02] ? [08:26:10] I mean the codfw filters on s3 [08:26:16] I meant the filters on s5 :) [08:26:18] you need filters on s5 [08:26:43] the problem is gtid may break [08:27:07] I think `SET GLOBAL GTID_DOMAIN_ID=0` [08:27:16] yeah, we need to disable it on db1070 [08:27:23] banyek: that is not disabling gtid [08:27:48] to be fair, technically gtid is always enabled on mariadb 10.0+ [08:27:49] fine [08:28:03] it is just active or inactive [08:28:05] there's is no gtid_mode variable which I used to use [08:28:18] banyek: remember this is mariadb [08:28:33] I can't forget it [08:28:35] which has its own implementation of gtid [08:28:52] I know, that's why I have no idea what I have to do [08:28:57] jynus: so the idea would be to disable gtid on both threads of db1070 [08:29:01] I never workded nor seen mariadb before here [08:29:08] banyek: just check mariadb doc to see how to disable gtid :-) [08:29:13] marostegui: in theory is is already disabled [08:29:22] as it would fail otherwise [08:29:28] jynus: true, it is multisource [08:29:53] so the filter needed for s5 is to ignore the "new" wikis [08:29:57] so we need to setup repliation as usual [08:30:09] and add a filter on codfw for the new, not existent wikis [08:30:13] correct [08:30:17] and then check gtid doesn't break [08:30:23] which shouldn't [08:30:43] because it doesn't break on the eqiad replicas with gtid enabled [08:31:00] worse case scenario, codfw master breaks [08:31:04] (replication) [08:31:09] which won't affect production [08:32:08] setup the replica and we will quadruple check it [08:32:12] yep [08:32:18] on db2052 [08:33:10] my worry is circular replication + filters shounds scary [08:33:58] jynus: check s5: db2052 [08:34:04] I just find in docs 'how to disable using gtid on slaves when replicating', [08:35:25] banyek: cool [08:35:29] jynus: Replicate_Wild_Ignore_Table: enwikivoyage.%,cebwiki.%,shwiki.%,srwiki.%,mgwiktionary.% [08:35:34] yeah [08:35:41] I am thinking about the heartbeat table [08:35:48] mmm [08:35:59] from s3, if it could be stuck in a circle [08:36:00] might break the sanitariums? [08:36:28] but we don't replicate that, right? [08:36:40] yeah, we skipped that [08:36:42] let me check [08:36:43] so that should be ok [08:37:52] so I can disable gtid on db2034 as `CHANGE MASTER TO MASTER_USE_GTID=OFF` [08:37:58] marostegui: I think to the best of our plan, that should be ok [08:38:08] banyek: we are with you in a sec [08:38:13] okok [08:38:24] jynus: the filters look good, right? [08:38:25] I checked both db1070 [08:38:30] yes [08:38:39] and db2052 [08:38:47] let's do it [08:38:49] let's start replication [08:38:56] done [08:39:18] no complains so far [08:39:23] let me check network [08:39:58] no high spike so far [08:40:05] and no errors from the replicas? [08:41:19] yeah, it looks good [08:42:00] actually, if there is no writes to s3 eqiad for the old wikis, we may not need filters on codfw for s3? [08:42:24] tables will be renamed [08:42:30] so any write will fail actually [08:42:38] yes but only on the master, right? [08:42:38] so yeah, we probably don't need them [08:42:41] yeah [08:42:54] but then we don't need the filters on s3 codfw master [08:42:55] I will need to setup a full plan for the switch [08:43:12] banyek: let's go with x1? [08:43:12] but only to recover the lost edits [08:43:18] jynus: yep [08:43:20] we will think about that later [08:43:33] yep [08:44:12] banyek: let me pull your neck a bit by saying that from "there is nothing to check!" to "I may know what we may have to do" there is a huge leap :-D [08:44:39] it was 'I don't know what to check' [08:44:58] ok, fair [08:45:08] do you have some commads for us? [08:45:10] banyek: That is why I thought it was a good idea to give you one master to do, so you get in full context :-) [08:45:23] marostegui: agree [08:45:34] banyek: also don't worry, if you mess up, you only break all the servers on production for that section! [08:45:43] XDD [08:46:23] do you know the commands? [08:46:29] jynus: I was googled the hell out of 'disabling gtid on mariadb' but found nothing, so I turned to 'setting up gtid for mariadb' :D And I only found how to disable gtid on a slave [08:46:42] banyek: so, is that enough? [08:46:42] ok, that may be it [08:46:49] with `change master to master_use_gtid=off' [08:46:59] banyek: how will you check if that was successful? [08:47:02] as the master always "replicates" gtid [08:47:10] what we want is to disable it as a replica [08:47:25] which is more improtant- where to run it? [08:47:35] and how to check if it was successful [08:47:41] oh [08:47:42] yep, both [08:48:09] well, my guess is the GTID_OP_POS will be empty on the master's show slave status output [08:48:11] hm [08:48:21] guess again :-) [08:48:40] banyek: hint: you might want to compare outputs ;-) [08:50:41] what I am seeing now: db1069 is already set up as a slave of db2034 but gtid_io_pos has gtid's [08:51:00] and there's the using_gtid=slave_pos already set up [08:51:03] anything else interesting on that output? [08:51:04] right [08:51:16] maybe check another host we disabled gtid on? [08:51:19] and check the differences? [08:51:33] e.g. db2034 [08:52:02] on db2034 the show slave status output is empty [08:52:17] so look for another one :) [08:52:21] fair [08:52:48] pn db2045 I just see this two setups, the 'Using_gtid' and the 'GTID_IO_POS' [08:53:15] the other variables seems 'normal' [08:53:21] banyek: what does Using_gtid says? [08:53:47] No || Slave_Pos [08:53:53] there you go, no? [08:53:57] yep [08:54:13] so, what's the plan? [08:54:13] an easier question now [08:54:33] if you just run "change master to master_use_gtid=off" it will fail, why? [08:54:56] because the 'off' != 'no'; [08:55:05] with no it will fail too [08:55:10] which actually makes sense [08:55:12] :-) [08:55:14] because the 'off' != '"no"'; [08:55:22] It will fail with "no" too [08:55:42] ^that is what I really wanted to ask [08:56:52] (that is not mariadb-specific, so you should not that one) [08:56:56] *know [08:56:58] I see this 3 options `MASTER_USE_GTID = {current_pos|slave_pos|no}` [08:57:59] Yeah, those are the mariadb GTID enabling options, but actually, changing to any of those will still fail :-) [08:58:53] as a reminder, we have a meeting in one hour, we want to be done by then :-) [08:59:09] please help, I am really clueless now, so you are just torturing me guys [08:59:16] <3 [08:59:39] any cHANGE MASTER command will fail with replication running [08:59:50] we don't want to stop replication on eqiad [08:59:58] banyek: in order to do change master you need to stop slave and then start it [09:00:06] so we just enclose thecommand with STOP SLAVE; --- ; START SLAVE; [09:00:49] https://www.dropbox.com/s/y8rp0q9edr7ioxw/Screenshot%202018-10-08%2011.00.44.png?dl=0 [09:01:13] banyek: no [09:01:29] banyek: you need to do: "no" instead of "slave_pos" [09:01:59] that is the command to "enable gtid" [09:02:08] so you did have it! [09:02:10] I meant I was know what I can't change that without stopping and wanted to show, (ofc. I was thinking on 'no' now) but I didn't write [09:02:19] but agree, I didn't wrote that [09:02:33] banyek: as I said, be verbose, we don't know what you are thinking :-) [09:02:38] ok, can you run "no" on which server? [09:03:16] I was really just focusing on 'what is the problem with the command itself' which was a mistake indeed [09:03:41] ```STOP SLAVE; CHANGE MASTER TO MASTER_USE_GTID = no; START SLAVE;``` [09:03:50] hm. [09:03:51] on which host? [09:03:52] where? [09:03:52] no. [09:04:00] db1069 [09:04:15] and how will you check it was indeed disabled? [09:04:27] tht will ensure that it will just use binlogs from db2034 [09:04:45] I am not sure I understand what you mean [09:04:47] in the output of 'SHOW SLAVE STATUS\G [09:04:48] ' [09:05:01] gtids [09:05:04] I think I am getting what you are saying [09:05:16] I am a bit stressed no [09:05:18] w [09:05:33] but it is more elaborate than that, it will use the bilong just for replication control [09:05:39] which is what we want [09:05:52] so let me take a big breath, and [09:05:53] ... [09:06:05] So, you've got the host, the command and how to check it, so we are good [09:06:41] on the db1069.eqiad.wmnet host I'll issue the 'STOP SLAVE; CHANGE MASTER TO MASTER_USE_GTID=no; START SLAVE COMMANDS; [09:07:26] after this, I'll able to check the outout of 'SHOW SLAVE STATUS\G' if the using_gtid: no and gtid_io_pos: shows what I need to see [09:07:46] \o/ [09:07:58] cool, then let me do this [09:08:55] the Using_Gtid: No, but the Gtid_OP_Pos: kept the old gtid's [09:09:21] that is expected [09:09:27] which is not the same as I've seen on the db2045 [09:09:36] I think I'll need to clean that up [09:10:08] do i? [09:10:24] sec. I've got a phone call [09:10:24] no, you don't have to clean that up, I can explain you why that Gtid_IO_Pos has so many stuff in there later [09:10:39] it is not easy to cleaup [09:10:49] it gets those from the master [09:10:58] and that neads a binlog purge, which we are not going to do [09:11:01] those are the global_gtid_domain_id :) [09:11:05] there is abug about that [09:11:11] which comes with all the bug report we had to file about multisource [09:11:23] I think I sent you the link, but I can do it later, so you can read it and have great fun [09:12:02] but, let's leave that for later and go with the next step [09:14:09] back [09:15:30] what I am positive for now: the slave thread in connected to the master host [09:17:08] and I see the server_id of db1069 in the output of 'show slave hosts' on db2034 [09:17:24] is gtid disabled? [09:18:17] yes [09:18:21] ```MariaDB [(none)]> pager grep Gtid; show slave status\G pager; [09:18:21] PAGER set to 'grep Gtid' [09:18:21] Using_Gtid: No [09:18:21] Gtid_IO_Pos: 0-171970580-683331037,171966572-171966572-316946218,180363268-180363268-40608909,171970580-171970580-596994206,1-171970580-1,180355159-180355159-103313729,171974681-171974681-198565537 [09:18:21] 1 row in set (0.00 sec) [09:18:21] Default pager wasn't set, using stdout.``` [09:18:34] great! so what's next? [09:20:31] I don't know [09:20:49] the whole poing of this was to setup circular replication, how to do that? [09:20:50] so, what is it what we are trying to achieve? [09:21:43] to make sure when we flip over the datacenters and the writes back to eqiad, the replication keeps working [09:21:55] that's it [09:22:25] so, we set up now replication as db1069 -> db2034 [09:22:29] there you go [09:22:54] ok [09:23:00] let's get the coordinates then [09:23:15] I can stop slave on db1069 now [09:23:25] to 'freeze' the coordinates [09:23:37] right? [09:23:52] why do you need freezing? [09:24:39] also, stop slave won't freeze coordinates [09:24:41] ah [09:25:04] why not? does anything else writes the host? [09:25:15] pt-heartbeat-wikimedia [09:25:15] check :) [09:25:30] which is most of the reasons why we have to not use gtid [09:25:38] oh [09:25:38] (although not all) [09:25:55] but we don't care to skip a few heartbeats [09:26:04] it does replace on the same row [09:26:11] so plan? [09:26:53] (note we could do a better heartbeat plan, I am just stating facts about our current setup, not saying that is good or bad) [09:27:47] ```1, copy the output of `ps-ef pt-heartbeat' to a texteditor [09:27:47] 2, kill the process [09:27:47] 3, set up replication on db2034 with the output of show master status on db1069 (after the pt-heartbeat is killed) [09:27:47] 4, restart pt-heartbeat [09:27:47] ``` [09:28:22] as jynus said, we don't mind skipping a few heartbeat transactions [09:28:51] so just pick one binlog pos? [09:29:02] yep [09:29:33] Keep in mind that also killing heartbeat would generate (fake) lag on eqiad, and if you are not quick enough, it will spam with alerts (only irc) [09:29:45] banyek: that is circular replication 101 [09:31:02] then just a simple change master to ... start slave as [09:31:08] ` [09:31:08] CHANGE MASTER TO MASTER_HOST='db1069.eqiad.wmnet', MASTER_USER='repl', MASTER_PASSWORD='...secret...', MASTER_LOG_FILE='db1069-bin.000185', MASTER_LOG_POS='64589058'; START SLAVE;` [09:31:15] wrong [09:31:27] banyek: you are missing MASTER_SSL=1; [09:31:27] this is a cross dc replication [09:31:35] TRUE [09:31:41] (we do it on all setups, so please stop not using it) [09:31:57] ```CHANGE MASTER TO MASTER_HOST='db1069.eqiad.wmnet', MASTER_USER='repl', MASTER_PASSWORD='...secret...', MASTER_LOG_FILE='db1069-bin.000185', MASTER_LOG_POS='64589058', MASTER_SSL=1; START SLAVE;``` [09:32:01] no [09:32:07] don't start slave [09:32:09] don't start slave yet [09:32:13] ok [09:32:18] we double check, as we did with manuel [09:32:18] so we can verify [09:32:20] ```CHANGE MASTER TO MASTER_HOST='db1069.eqiad.wmnet', MASTER_USER='repl', MASTER_PASSWORD='...secret...', MASTER_LOG_FILE='db1069-bin.000185', MASTER_LOG_POS='64589058', MASTER_SSL=1; ``` [09:32:29] (which is what jaime did with every single change I did) [09:32:50] banyek: where will you run that? [09:32:54] ok makes sense [09:33:04] marostegui: db2034.codfw.wmnet [09:33:23] can I? [09:33:26] banyek: this is kinda critical, so we normally ask each other (whoever sets up replication) to check what the other did, no matter how many times we've done it [09:33:36] banyek: yep, run it and we can check [09:33:45] ok, noted [09:33:59] that is true for all non-trivial actions [09:34:26] you can check now [09:34:32] checking [09:36:01] looks good to me [09:36:06] same here [09:36:27] now start slave on db2034 and wait for breakage [09:36:36] XD [09:36:38] starting slave [09:36:54] I am being serious, even the most trivial changes [09:37:05] replicates [09:37:08] you should expect failure so you don't get surprised [09:37:34] So all masters in codfw have a working replication channel [09:37:57] banyek: on thursday we will do the opposite action, so get ready for it ;-) [09:38:14] I will [09:38:20] and should take much less time [09:38:41] or you can do it on friday, while both of us are on holidays \o/ [09:38:43] review your notes and ask anything you may not fully undertood about arch [09:38:51] or decisions [09:38:55] heh [09:39:07] no, this was clear, I just have to note it down [09:39:08] yeah, ask as many things as you need to understand what was done and most importantly, why [09:39:29] technically the "setup replication" is automated on the wmf replication libary [09:39:41] and it being setup is the normal state [09:39:50] but we disable it for maintenance [09:39:57] and to avoid mistakes [12:15:53] I deploy the wmf-pt-kill now rest of the labsdb hosts [12:15:59] (1009 and 1011) [12:24:20] hah, we already had the first query catched by it: # 2018-10-08T12:04:15 KILL 73042306 (Query 14408 sec) Select [12:24:26] (analytics) [12:56:02] 10DBA, 10Data-Services, 10Patch-For-Review, 10User-Banyek: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983 (10Banyek) [12:57:20] 10DBA, 10Data-Services, 10Patch-For-Review, 10User-Banyek: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983 (10Banyek) I am not sure if we wanto to close this ticket. I mean the original problem is now solved, but it would be nice to keep track when the u... [13:13:55] banyek: check it is no killing queries with 0 seconds [13:15:02] I checked the logs so far, and I only found one event so far, and that query took 14408 sec [13:15:10] Great! [13:15:44] what shall we do with that ticket? [13:15:52] 10DBA, 10Data-Services, 10Patch-For-Review, 10User-Banyek: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983 (10Marostegui) >>! In T183983#4649383, @Banyek wrote: > I am not sure if we wanto to close this ticket. > I mean the original problem is now solved... [13:15:55] hehe ^ [13:15:57] :D [13:16:22] cool [13:16:58] have you tested if you kill the process and then puppet runs and starts it again? [13:17:26] no, that's a good idea to check [13:17:29] tx [13:18:04] I don't remember if you set ensure running or stopped [13:18:16] But it would be a nice check to have [13:18:31] it was ensure=>stopped for the weekend, but the enabling part was to ensure running [13:18:44] great [13:18:59] so it is now ensure running? [13:19:05] yes [13:19:10] cool [13:19:48] one thing is missing only, as Jaime found out: probably we want to set up logrotate [13:19:56] yep [13:31:23] I am going to start checking the GTID status in all the hosts in eqiad [13:37:18] lets close all pt-kill related tasks, but let's open one for rotating /var/log/wmf-pt-kill/wmf-pt-kill.log [13:37:28] 👍 [13:37:54] actually checking the DBA table, I found this: https://phabricator.wikimedia.org/T165677 [13:38:15] I have an out-of-the-box working go solution for this, I can rewrite it in python [13:38:22] ? [13:38:38] we have checks for mysql [13:38:39] http status check for mysql [13:38:50] what it is needed is a pybal native solution [13:39:27] and honestly, we shouldn't work on that at is more than likely we will not use pybal for load balancing [13:40:03] ok [13:43:51] jynus: Are you still using db1110 for testing or can that be repooled? [13:54:55] db1110? [13:55:01] I cannot remmember [13:55:25] ah, yes, I tested the import there [13:55:32] but then deleted the dbs [13:55:40] I can repool ir [13:56:02] let's let banyek do it! [13:56:09] so he gets some more practice with db-eqiad? :) [13:56:29] oh yeah [13:56:32] oh [13:56:36] I just hit revert [13:56:40] haha [13:56:41] oh no [13:56:45] as it took me only 1 second [13:56:50] :) [13:56:56] `/o\` [13:56:58] he can merge if he wants [13:57:05] lemme [13:57:08] jynus: what other checks did you mention you wanted to do aside from gtid? [13:57:31] gtid, semisync, binlog format [13:57:39] So, gtid is done [13:57:52] banyek: do you wanna check semsync and I take binlog format? [13:58:40] I am in the same relation of semisync replication as with GTID replication [13:59:02] I can take semisync and you take binlog format then? [14:00:48] I'd go for semisync as a new thing but let me finish the repooling [14:00:52] It is not merged yer [14:00:54] yet [14:01:27] ah cool :) [14:01:35] I will start checking binlogs then :) [14:13:38] semi sync master enabled on s1 (db1067) [14:14:03] post that on the public channel better, so we are all on the same page [14:17:35] don't paste 200 lines on operations, or any channel [14:17:43] create a paste, link it from there [14:17:56] and don't touch the masters if they had it already enabled [14:17:59] jynus: any specific reason why the dbstore_multinstance uses STATEMENT and not row? [14:18:38] is statement on config? [14:18:50] or on yaml or where? [14:18:54] haven't checked yet, I was still running the live check :) [14:19:05] note dbstores don't have a binlog [14:19:10] so not taht it matters much [14:19:25] so probably we didn't bother to change the default [14:19:41] beacuse, well, it wouldn't have any effect [14:21:05] yeah, that is probably it, as there is no binlog format on their config [14:21:23] mystery solved [14:24:33] jynus: I am going to revert: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/465173/ [14:24:49] this: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/452649/ [14:25:04] db1122 is no longer the candidate master so it should have ROW [14:25:21] looks like it was done during an emergency https://phabricator.wikimedia.org/T201694 [14:25:46] why row? [14:25:54] because it is a slave [14:25:57] we don't have any core host in row [14:26:07] explicitly [14:26:14] except the sanitarium hosts [14:26:30] we shoudl keep on yaml only the ones that have a reason [14:26:51] We have many of them on ROW on the yaml [14:27:00] Not saying it is good, but it is the fact [14:27:06] Either way, this one should not be STATEMENT [14:27:24] I don't disagree [14:27:24] I will remove it (with a different commit) [14:27:33] but I think you should just remove the line [14:27:39] yeah, I will do that [14:27:45] we should only have as row the sanitarium's masters [14:27:54] or for other reasons [14:28:05] if not, we should keep the default [14:28:36] we need to review the yamls then (at some other time) [14:29:04] sure [14:29:14] we may want to remove all binlog format ones [14:29:17] yep [14:29:35] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/465174/ [14:29:36] as the masters or sanitarium master's could be its own role or propery [14:29:46] but as you said, not now [14:30:10] or "candidate master" [14:30:35] that way it is clear the reasons for the format rather than harcoding it [14:30:49] what? [14:31:21] RE: we need to review the yamls then (at some other time) [14:31:32] ah yeah [14:31:34] work for another time [14:31:41] I thought you were talking about the commit message [14:31:41] +1 [14:31:44] as there is a very similar sentence [14:31:45] haha [14:33:42] it was a good call to delay the table rename + filter [14:34:15] there is a chance of delaying the switchback due to network issues [14:34:31] what? were is that being discussed? [14:34:33] *where [14:34:52] ah [14:34:57] the previous conversation on the other channel [14:35:01] you mean that? [14:36:07] informally, nothing serious [14:36:14] but I guess it is a possiblity [14:36:41] you mean that? -> yes [14:42:08] I will upgrade, restart and reboot db1122 for the binlog change [14:42:11] tomorrow [14:44:29] binlog format have been checked [14:44:35] banyek how's the semisync going? [14:44:46] need any help? [14:45:13] checked all the masters [14:45:19] I put it on operations [14:45:27] https://www.irccloud.com/pastebin/a8pk7Qr3/ [14:45:44] yeah, what about the slaves? [14:46:17] and please don't tell me you manually logged to each one and run commands... [14:46:33] ACTUALLY I did [14:46:45] but I know I need to use cumin for that [14:47:03] banyek: you can use zarcillo for that even [14:47:15] banyek: you have more than 150 slaves to check! :p [14:47:50] or even: "section" command [14:48:05] * marostegui <3 section [14:48:19] I will vanish today at 17:00 because I have errands to run [14:48:37] try this: ./software/dbtools/section x1 | while read db; do mysql.py -BN -h $db -e "SELECT @@hostname"; done [14:49:18] you may need the operations/software repo [14:50:46] sure, just keep this in mind for tomorrowP [14:50:47] ^ [14:52:03] I'll do it later today [14:52:08] BUT THIS IS THE VERY MOMENT [14:52:16] for solving this: [14:52:25] https://www.irccloud.com/pastebin/aeYZZMBX/ [14:52:42] we were talking about this [14:53:22] ah [14:53:23] sudo -i [14:53:43] yep [14:54:40] or just ls -l /home/jynus/.my.cnf and do the same [14:56:04] so you still need sudo but without -i, up to you [14:56:25] https://www.irccloud.com/pastebin/NJD0m8HA/ [14:56:44] I'll make this working/and better looking [14:56:54] Maybe today later, maybe tomorrow morning [14:56:56] TIL [14:56:57] thx [14:57:33] now I run [14:57:44] I'll put the paste here and to the operations as well [14:57:51] bye [14:57:53] thanks [14:58:40] np