[04:38:19] 10DBA, 10CheckUser, 03Community-Tech-Sprint: Investigation: Add old and new length columns to cu_changes - https://phabricator.wikimedia.org/T155734#2963906 (10MusikAnimal) [07:56:08] 10DBA, 06Operations, 10Wikimedia-General-or-Unknown: Spurious completely empty `image` table row on commonswiki - https://phabricator.wikimedia.org/T155769#2964262 (10Marostegui) >>! In T155769#2962307, @matmarex wrote: >>>! In T155769#2960504, @Marostegui wrote: >> If you guys consider it is safe to delete,... [08:13:16] 10DBA, 13Patch-For-Review: Fix dbstore2001 and dbstore2002 - https://phabricator.wikimedia.org/T130128#2964286 (10Marostegui) I have deployed the change that split the classes into dbstore (tokuDB) and dbstore2 (InnoDB) [08:14:38] 10DBA, 07Epic, 13Patch-For-Review: Decouple roles from mariadb.pp into their own file - https://phabricator.wikimedia.org/T150850#2964287 (10Marostegui) I have pushed the change that decouples dbstore from mariadb.pp: https://gerrit.wikimedia.org/r/332228 It split it into dbstore (runs tokuDB) and dbstore2 (... [08:15:24] 10DBA, 13Patch-For-Review: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418#2964288 (10Marostegui) gtid_domain_id pushed to dbstore2 (that is dbstore2001 and dbstore2002). [08:47:07] 10DBA, 10Wikidata, 07Performance, 15User-Daniel, and 2 others: Build an environment to test change dispatching using Redis-based locking - https://phabricator.wikimedia.org/T155190#2964337 (10WMDE-leszek) [09:06:37] 10DBA, 06Labs, 10Tool-Labs: Reset password for database user p50380g50491 - https://phabricator.wikimedia.org/T155902#2964387 (10Marostegui) We granted you all on: `p50380g50491_common` the other databases didn't exist on either labs or tool boxes. [09:25:46] 10DBA, 06Operations, 10ops-eqiad: Move db1051 to row B3 - https://phabricator.wikimedia.org/T156004#2964458 (10Marostegui) Alerts silenced for 24 hours - I will re-enable them once the move is done. [09:27:04] 10DBA, 06Operations, 10ops-eqiad, 13Patch-For-Review: Move db1052 to row B3 - https://phabricator.wikimedia.org/T156006#2964460 (10Marostegui) Alerts silenced for 24 hours - I will re-enable them once the move is done. [11:45:51] jynus: yep, we need to decide which one can be db1095 new master, I thought about db1057 once it is no longer a master? [11:47:08] ok [11:48:20] so when we do the switchover, while still in read only and once db1057 isn't the master, we change its binlog to ROW and repoint db1095 [11:49:31] that doesn't work because you need to change db1052 to statement [11:49:58] and db1095 will receive statement events even if stopped [11:50:10] ah, the events… [11:50:18] no, edits [11:50:33] the failover will not be instantaneous [11:50:34] i thought about doing all that while in read only [11:51:15] if not, we can try to select another host to be its master, but there are not many options here…maybe one of the big ones? [11:51:31] it doesn't matter, we just need a buffer [11:51:43] temporarelly [11:51:55] we can use db1072 [11:52:02] it doesn't matter if it lags [11:52:18] that's true [11:54:08] let's do that [11:54:21] we need to check if the script to stop the slaves in the same position works [11:54:24] and restart and upgrade db1052 [11:56:31] we can do that manually if needed [11:57:28] stop slave db1052 - wait 5 seconds - stop slave db1073 - start slave on db1052 until db1073's position [11:58:30] yeah, that will work too [11:58:41] dumps seem to have finished [11:58:49] so we can depool it easily [11:58:49] I was checking the script and apparently this would be it: ./repl.pl --stop-siblings-in-sync --host1=db1072.eqiad.wmnet:3306 --host2=db1052.eqiad.wmnet:3306 [11:58:55] yes [11:59:14] let me depool it [11:59:17] which basically does the same [12:00:06] but before doing that we need to change the binlog format [12:00:09] yep [12:00:23] I am depooling db1072 [12:00:25] and make sure it is in use [12:01:29] come on git…be fast [12:02:26] note we should be focusing on depooling things [12:02:35] this can be done after the server movement [12:02:42] i have the patch uploaded to debool db1051 [12:02:51] i was just waiting a bit closer to the date [12:03:02] sure [12:03:17] do you want me to reimage the other api host? [12:03:32] I repooled the one you did yesterday and recovered its normal weight [12:03:55] I do not want to do more depools than necesary now [12:04:03] I can do it after the movement [12:04:16] yes, totally agree :) [12:04:34] btw, I added the racks to each server in s1 (so far) [12:05:17] do we switchover tomorrow at 7am ? [12:06:05] I would like to have db1052 running for at least 24 hours in its new rack, I might be too paranoid [12:06:28] maybe thursday 7am? [12:08:53] ok [12:10:03] let's send an email to ops once we have the movements done to let them know that we are aiming for thursday 7am [12:10:06] ? [12:10:17] ok [12:12:50] should we move away one of the api D1 servers? [12:12:51] D1: Initial commit - https://phabricator.wikimedia.org/D1 [12:13:07] or just stop using it in the future? [12:13:54] I would move one of them out. not nice to have all of them on the same rack [12:14:27] but not using them in the future would be a better long term solution indeed [12:18:01] lets move the racks at the beginning of the comment [12:18:14] sure [12:18:25] let me create the task to move one of the API servers [12:18:26] because of the other comments [12:18:33] any preference? [12:18:40] it is hard to read [12:19:10] maybe we can just swap them for one of the ones we are moving? [12:20:08] sorry, I meant any preference on which server to move from api? [12:20:41] "which server to move from api"? [12:21:21] yes, you suggested to move one of the API servers away as they are all on the same rack (D1), which I agree with, although as D1 isn't under trouble now, there is no need to do it today or tomorrow [12:21:21] D1: Initial commit - https://phabricator.wikimedia.org/D1 [12:21:28] I was just creating the task so we do not forget [12:22:02] I do not know, they are all the same except 66, which is not upgraded [12:22:18] ok, I will just get db1073 then [12:24:21] so what things are needed in the next 2 hours? [12:24:38] push the patch to depool db1051 [12:24:49] stop replication on db1095 before we move db1052 [12:25:16] silence alerts for db1051 and db1052 (already done) [12:25:24] stop 1051 and 52 [12:25:28] and once the moves are done, we can proceed and reimage the pending api host [12:25:40] ok [12:26:42] then we can point db1095 to the new 52 ip, depool 73, change the format, change the master [12:26:52] yep [12:27:06] 10DBA, 06Operations: Move db1073 to B3 - https://phabricator.wikimedia.org/T156126#2964856 (10Marostegui) [12:30:55] ok, I will have lunch now so I am around later [12:31:10] enjoy! [12:31:26] 10DBA, 06Labs, 06Operations, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2964878 (10Marostegui) [12:31:50] 10DBA, 06Labs, 06Operations, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2961118 (10Marostegui) [12:51:18] 10DBA, 06Labs, 06Operations, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2964931 (10Marostegui) [13:28:43] jynus: tendril needs to be updated manually with the new ip? I saw the dns table but I am unsure whether it is populated automatically or not [13:29:37] I do not think so [13:29:46] I think it does everthing based on dns [13:30:13] ok - I will wait a couple of hours [13:30:16] thanks! [13:30:40] wait for what? [13:30:49] for it to see if it gets updated [13:30:55] (tendril) [13:31:02] I do not understand [13:31:16] what has been changed already? [13:31:16] sorry, i didn't explain myself correctly [13:31:20] db1051 :) [13:31:25] db1051 is back up with its new ip [13:31:30] oh [13:31:35] I didn't notice that [13:31:41] yeah, cmjohnson1 is super fast! [13:31:51] I was updating the ticket [13:31:55] yes, it's there [13:31:58] check tendril [13:32:16] oh [13:32:18] already changed :) [13:33:01] I thought it was going to happen at 14UTC [13:33:10] yeah, but chris was onsite earlier and pinged me [13:33:12] so we went ahead :) [13:33:33] 10DBA, 06Labs, 06Operations, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2965022 (10Marostegui) [13:33:36] 10DBA, 06Operations, 10ops-eqiad, 13Patch-For-Review: Move db1051 to row B3 - https://phabricator.wikimedia.org/T156004#2965019 (10Marostegui) 05Open>03Resolved a:03Cmjohnson db1051 has been moved. DNS updated db-eqiad,codfw files updated mysql and replication started finely. tendril updated Thanks... [13:53:41] marostegui: here? [13:53:45] yep [13:53:47] ok [13:53:55] i was just talking to jynus about the master switches etc for C2 [13:54:32] i understand you guys also discussed it? [13:54:47] yep a bit [13:54:48] should I read your discussion? [13:54:56] you mean about announce it or not, right? [13:55:09] i think if we are relatively sure we can do it in < 2 minutes, we shouldn't need to [13:55:13] we used to do master switches all the time [13:55:22] and I would like us to get back to a place where we do them more often and they aren't a big deal [13:55:26] (ideally also faster) [13:55:44] agreed - my only worry, which I told jynus about, was: what if it goes wrong or takes longer? [13:55:52] things can always go wrong [13:55:55] even now, without us doing anything [13:56:16] are there any specific reasons why we think things may go wrong with a high likelihood? [13:56:26] the thing is, we have a way to revert quickly [13:56:32] exactly [13:56:37] then let's not make a big deal out of this [13:56:44] we're doing this -exactly- because something else might go wrong anytime [13:56:47] No, I don't think it will go wrong [13:57:06] we can't always announce things a week in advance, that makes us less agile [13:57:17] true [13:57:26] instead, lets make sure we do these more often again, make them quicker, more automated, more routine [13:57:42] so jaime was asking me if we should test the new changes in mediawiki along the way [13:57:47] and I'd say let's -not- do this now [13:57:50] let's instead also do this, later [13:58:04] with or without etcd integration [13:58:14] I agree, the less variables the better (and the faster probably) [13:58:18] exactly [13:58:50] so if we can revert quickly, that seems fine [13:59:47] I think we can yes [14:00:35] cool [14:00:43] then I'd say you guys can do this at your own schedule, outside deployment windows [14:01:02] we were thinking about thursday 7am (planning to send an email a bit later) [14:01:38] ok [14:01:53] 7 am which timezone? [14:02:10] cet [14:02:14] use utc only :) [14:02:18] ah ok :) [14:02:22] 6utc then [14:02:23] I meant 7 UTC [14:02:26] aaah [14:02:29] when I proposed that [14:02:31] haha [14:02:34] haha [14:02:34] XD [14:02:39] that works [14:02:44] i won't be around, i have a dr's appt [14:04:18] no worries [14:10:33] 10DBA, 06Labs, 06Operations, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2965171 (10Marostegui) [14:10:36] 10DBA, 06Operations, 10ops-eqiad, 13Patch-For-Review: Move db1052 to row B3 - https://phabricator.wikimedia.org/T156006#2965168 (10Marostegui) 05Open>03Resolved a:03Cmjohnson db1051 has been moved. DNS updated db-eqiad,codfw files updated mysql and replication started finely. tendril updated thanks... [14:11:36] jynus: db1051 and db1052 have been moved [14:11:49] db1052 was obviously powered off and booted up with no issues, do you still want to reboot it? [14:21:41] also, planning to warm up db1051 like this: https://gerrit.wikimedia.org/r/#/c/333911/ what do you think? [14:27:21] I would like to upgrade it (specially the kernel), holding the mariadb package [14:27:31] and that probably requires reboot [14:27:34] sure [14:28:02] specially important for linked libraries like tls [14:28:05] openssl [14:28:46] sure, sounds good [14:28:54] following mark's advice, we could setup a day for small maintenance windows [14:29:10] every week (not necesarilly to do it every week) [14:29:16] yeah that's not a bad idea [14:29:21] that wasn't my advice though ;) [14:29:25] I know [14:30:43] it is a good idea yeah [14:31:09] specially to research when is the low load for every shard [14:31:40] I think enwiki and commons is at 7UTC [14:34:06] shall I reimage db1066, anything more important? [14:34:18] no, go ahead [14:34:29] once it is done, we can work out db1095->db1073 [14:34:44] but let's get that done as it is more important :) [14:34:53] not sure [14:35:00] I can do that later in my day [14:35:12] 95 -> 73 can be done faster [14:35:35] ok, let me repool db1051 (https://gerrit.wikimedia.org/r/#/c/333911/) with lower weight and then we can depool db1073 [14:35:49] ok [14:46:07] 10DBA, 10Wikidata, 13Patch-For-Review, 07Performance, and 3 others: Implement ChangeDispatchCoordinator based on RedisLockManager - https://phabricator.wikimedia.org/T151993#2965254 (10Andrew) [14:46:15] 10DBA, 06Labs, 10Wikidata, 07Performance, and 3 others: Increase quota for wikidata-dev project - https://phabricator.wikimedia.org/T155042#2965251 (10Andrew) 05Open>03Resolved a:03Andrew It looks to me like there's quite a bit of quota room in wikidata-dev already. I increased the max instance coun... [14:46:45] marostegui, did you change anything on dbstore2001 yesterday around 9? [14:47:01] no, yesterday I started to compress s2 on dbstore [14:47:10] ah, ok [14:47:11] but that was around noon if I recall correctly [14:47:17] ok, ok [14:47:19] (it is still on going) [14:47:23] IOPS was over the place [14:47:27] and I was worried [14:47:39] bad buffer pool stats, etc [14:47:43] no problem, then [14:47:44] no, this time it was me :) [14:56:48] 10DBA, 06Labs, 06Operations, 07Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#2965291 (10jcrespo) [14:56:51] 10DBA, 06Labs, 10Tool-Labs: enwiki_p replica on s1 is corrupted - https://phabricator.wikimedia.org/T134203#2965290 (10jcrespo) 05Open>03Resolved [15:00:05] jynus: I need another pair of eyes here, can you check if db1052 and db1073 are stopped on the same position: 232441149 [15:00:39] did you use the script? [15:00:42] yes [15:01:16] 10DBA, 10Analytics, 10Analytics-Cluster: Purge MobileWebWikiGrok_* and MobileWebWikiGrokError_* rows older than 90 days - https://phabricator.wikimedia.org/T77918#2965310 (10Ottomata) 05Open>03declined [15:02:12] but one is using mixed binlog_format [15:02:24] which makes no sense for what you want to do [15:02:27] yes, I haven't touched those yet [15:02:43] then why stop them? [15:02:46] just as a test? [15:02:51] oh - I was testing the script sorry [15:02:52] yes [15:02:56] ok ok [15:03:27] sorry to be asking some many checks, but it is a delicate operation as we can mess up db1095 and I prefer to have another pair of eyes here :) [15:04:42] yes, it worked [15:05:03] ok, going to start replication again and change the binlog to row on db1073 [15:05:05] the only place it doesn't work is on delayed slaves [15:06:16] or very lagging slaves [15:06:25] yeah I can imagine it goes a bit crazy [15:06:53] once we have gtid everywhere, it will be easier [15:07:57] ok, db1073 is now running ROW [15:08:20] check that with mysqlbinlog, reset the file, etc. [15:08:27] yep, i have flushed the log [15:08:48] I stressed that because doing it hot sometimes has issues [15:09:01] I think running transactions long in the old format [15:09:08] because the have cached the old format [15:09:15] ah, interesting [15:09:27] which is why I didn't want to do everthing in a single operation [15:09:34] indeed [15:09:43] *log [15:12:20] still running statement based :) [15:12:27] yes [15:12:49] so I was right to worry :-) [15:12:53] totally right [15:12:57] we might need to restart [15:13:18] btw: https://gerrit.wikimedia.org/r/#/c/333850/ (i included db1073 on the same change) [15:13:50] maybe let's separate the 73 [15:14:04] so we run without hacks [15:14:19] from the 52 change? [15:14:28] you mean split the two commits? [15:14:40] or just applying that, not restart 52 [15:14:42] ? [15:15:01] Ah, I see your point [15:15:03] let's split [15:15:08] let's not mess things yes [15:15:33] let's commit .73 first [15:15:36] and restart 73 [15:15:46] I do not mind [15:15:53] but I like to restart with final configs [15:16:08] so I am 100% I do not get an error in half a year [15:16:14] yep [15:16:14] *sure [15:16:27] let me split db1073 first [15:16:36] so we can restar tit, make sure it works and then repooint db1095 [15:16:42] and then restart db1052 with no rush [15:17:05] note those are just my own way of working [15:17:23] because for me it helps me keep track of 10 things at the same time [15:17:30] no, but I agree, I am scared of pushing that and if for whatever reason db1052 gets restarted now…db1095 bye [15:17:46] keep on puppet the real state [15:17:54] rather than the desided state [15:18:19] whith very few exceptions like the tls certs, etc. [15:20:09] you abandoned the commit but it was easier to repurpose it [15:20:22] or just create another and rebase the old one [15:20:52] i had a bit of a mess with my branches already, so I didn't want to mess it even more, better a fresh start (at least for my mind) [15:38:20] 10DBA, 06Labs, 10Tool-Labs: Reset password for database user p50380g50491 - https://phabricator.wikimedia.org/T155902#2965491 (10Tb) Apologies, double-underscores seem to get eaten by Phabricator's markup parser. On s1.labsdb: ``` p50380g50491_common p50380g50491__rlrl_enwiki_p p50380g50491__rlrl_ptwiki_p... [15:45:39] 10DBA, 06Labs, 10Tool-Labs: Reset password for database user p50380g50491 - https://phabricator.wikimedia.org/T155902#2965532 (10Marostegui) Thanks for the clarification. I have granted access to those databases. Please check them and let us know if that works! [16:04:15] 10DBA, 06Labs, 10Tool-Labs: Reset password for database user p50380g50491 - https://phabricator.wikimedia.org/T155902#2958263 (10jcrespo) @Marostegui and others ops, grants are wildcards, **never** use _ without escaping (\_) on a grant. It is not a big deal here, but it can lead to security problems. [16:16:33] 10DBA, 06Labs, 10Tool-Labs: Reset password for database user p50380g50491 - https://phabricator.wikimedia.org/T155902#2965610 (10jcrespo) @Tb your grants have been added- you should be able to access old data- however, you should consider those grants temporary, until you rename the databases to start with `... [16:17:31] jynus: let me know when you are around [16:17:46] I am here [16:17:51] so, https://gerrit.wikimedia.org/r/#/c/333926/2 [16:17:59] yes, merge [16:18:01] I can deploy that, and restart db1072 [16:18:52] I can do that if you want, I am blocked on the reimage for that [16:18:54] note that the host is db1072 (the one that belongs to vslow, dump) not db1073 as I said previously (which is API) [16:19:10] yes, that makes more sense to me [16:19:13] :) [16:19:41] although maybe we should depool it? [16:19:52] i did [16:20:17] looks pooled on: https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php [16:20:36] mmm [16:22:04] check now :) (forgot the fetch+rebase) so many things at the same time [16:22:26] where may I heard that before? [16:22:35] haha [16:26:07] going to restart mysql on db1072 [16:27:26] let's do it fast, I do not like having that role on 80 for long [16:27:32] yep [16:27:51] it is restarting now [16:29:30] would you have chosen an API host to help with vslow? [16:30:00] no, any is a bad idea, I just want it pooled back fastly [16:30:11] bad idea for long time [16:30:13] ok [16:30:28] it tends to create a large ibdata1 [16:30:35] because 24-hour selects [16:30:40] :| [16:30:43] right [16:30:49] (undo section) [16:31:01] the less times, the less likely we have such issues [16:31:10] *time it is there [16:31:29] sure [16:31:35] still restarting [16:32:08] will prepare the revert meanwhile [16:34:05] starting mysql now [16:34:48] which version was that running? [16:35:02] .23 [16:35:08] ok [16:37:28] the binlog looks good now [16:38:11] nice [16:38:17] sorry for so much pain [16:38:21] let's stop both hosts [16:38:24] it will be easier in the future [16:38:34] as we are aiming for row (maybe) [16:38:34] it is not your fault! :) [16:38:38] 10DBA, 06Labs, 10Tool-Labs: enwiki_p replica on s1 is corrupted - https://phabricator.wikimedia.org/T134203#2965716 (10Superyetkin) Could you please share the configuration details of new servers? Most of my tools [[http://tools.wmflabs.org/superyetkin/kategorisizsayfalar.php | like this]] (running on trwiki... [16:43:55] jynus: both servers stopped at 477438361 - please double check [16:44:23] 10DBA, 06Labs, 10Tool-Labs: enwiki_p replica on s1 is corrupted - https://phabricator.wikimedia.org/T134203#2965719 (10jcrespo) @Superyetkin I cannot guarantee it will not change in the future, but you can connect, in the case of **enwiki** to the `labsdb-web.eqiad.wmnet` host for short-lived, web-like reque... [16:44:52] if that is ok, this is what I would run on db1095: https://phabricator.wikimedia.org/P4795 [16:45:23] no [16:45:24] that is wrong [16:45:36] yes, it is wrong [16:45:42] you want the master coordinates [16:45:42] just changed [16:45:44] yep [16:45:48] just changed it [16:46:31] it should use ssl, not sure if it already done, so it is not needed [16:47:24] db1095 isn't using ssl at the moment [16:48:00] ok, adding MASTER_SSL=1 should fix that [16:48:07] at the end [16:48:28] just added, refresh :) [16:48:30] it doesn't hurt, if there is any problem, it will complain [16:48:44] just start the io_thread first [16:48:52] (shouln't be needed anyway) [16:49:13] looks good [16:49:45] ok [16:49:47] let's go then [16:51:45] io thread running [16:52:00] let's go for sql [16:52:43] 10DBA, 06Labs, 10Wikidata, 07Performance, and 3 others: Increase quota for wikidata-dev project - https://phabricator.wikimedia.org/T155042#2965728 (10Ladsgroup) We cleaned some instances and it's okay now. Probably we will make more soon. [16:52:56] both running [16:53:04] let's start replication on db1072 then? [16:54:47] yes [16:54:54] let's go then [16:55:15] done [16:55:29] looking good [16:55:40] crashed [16:55:53] what? [16:55:58] dup entry [16:56:09] are you for real? [16:56:22] yes :( [16:56:45] well, you know what that means [16:56:50] i know yes :( [16:57:55] i was king of expecting it to be honest [16:58:04] why? [16:58:21] i wasn't completely convinced that db1072 and db1052 would have exactly the same data [16:58:49] you think it is not a coordinate problem? [16:59:03] let's try to get rid of this row [16:59:06] and see what happens [17:00:49] wait [17:00:53] sure [17:00:56] which entry did you move to? [17:01:07] i haven't deleted anything [17:02:19] I mean you changed to log pos 42375201 [17:02:24] no [17:02:28] 38678449 [17:02:34] yes [17:03:00] yeah, that is 4 MBs of changes [17:05:44] it is also the great change_tag table, no PK there [17:09:25] it could be just a "simple" schema change drift [17:09:52] I see the same insert twice [17:10:03] on the binlog [17:10:05] interesting... [17:10:43] and probably it is that [17:10:46] is it exactly the same? [17:10:50] not at all [17:10:54] one has unique keys [17:10:58] the other don't [17:11:26] have you altered data or anything yet? [17:11:32] no [17:11:39] nothing [17:11:49] this is a mess [17:11:58] and that is why master changes are a problem [17:12:08] not because operational problems [17:12:11] content is a mess [17:12:39] yes :( [17:12:56] so s1-master contains the unique kets [17:13:08] I do not get it [17:13:16] db1052 also has unique keys [17:13:19] maybe it was an insert ignore? [17:13:51] run: /opt/wmf-mariadb10/bin/mysqlbinlog --no-defaults /srv/sqldata/db1072-bin.003214 --start-position=42374334 -vv --base64-output=DECODE-ROWS | less [17:13:55] on db1072 [17:14:06] s1-master unique keys, db1052 - unique keys, db1072 - no unique keys - db1095 - unique keys [17:14:06] so I am not imaginining things [17:14:23] let me check [17:14:47] search for key 905199194 [17:15:26] 905199194-mobile edit is there twice, isn't it? [17:15:28] it is the same [17:15:42] how can that be on the binary log? [17:17:01] i was checking db1072 slave_exec_mode just in case [17:17:04] but it is strict [17:17:41] I am not imagining things [17:17:46] no [17:17:48] it is the same insert [17:17:57] SELECT * FROM change_tag wHERE ct_rc_id=905199194 and ct_tag='mobile edit'; [17:18:00] run that^ [17:18:06] on all servers involved [17:18:17] there are 2, on db1072 [17:18:27] and none on db1052 [17:18:28] which is by schema impossible [17:18:29] and db1095 [17:18:45] well, it can has been deleted later [17:19:02] but there are 2, and it is impossible for 2 to be there on other servers [17:19:11] by schema [17:19:26] it is super weird [17:19:29] I don't get it [17:20:21] well, it is what happens when you do not have primary keys [17:20:40] and different schemas on every servers [17:20:48] :( [17:21:08] what if we add the unique keys now to db1072? [17:21:14] well, we can have loads of dup rows [17:21:19] well, we should [17:21:39] but since replica broke till now we do not know what it will happen [17:22:30] those two rows aren't present on db1052 [17:22:45] is 52 slave stopped? [17:22:56] yep [17:23:22] ok, then drop rows, lets add the unique keys and pray [17:23:27] xdddd [17:24:26] we do not want to replicate this, right? [17:24:27] it is only 25 million tags [17:24:35] *rows on tag [17:24:37] yes, it is 5G table [17:24:39] should be fast? [17:24:39] not too big [17:24:42] i think so [17:24:48] yeah, the usual script [17:24:57] it is depooled anyway [17:25:04] no, I mean to delete the rows [17:25:09] we don't want to replicate that delete [17:25:44] well, let's try the alter first [17:25:48] see where it complains [17:25:52] ok [17:25:59] let's try one by one key then [17:26:01] but no binlog, too [17:27:29] I'm running "SELECT ct_rc_id, ct_tag FROM change_tag GROUP BY ct_rc_id, ct_tag HAVING count(*) > 1" [17:28:01] i am going to run (with the script no replicate): alter table change_tag add UNIQUE KEY `ct_rc_id` (`ct_rc_id`,`ct_tag`); [17:28:14] 66653 rows in set (36.30 sec) [17:28:18] so don't bother [17:28:24] pf [17:29:25] let's move db1095 to another host [17:29:49] let's see what we have [17:30:09] the above query on 52 gives 5 results [17:30:21] but id null [17:30:28] which may or may not make sense [17:30:43] and then you ask me why I want to move to row based replication? [17:30:47] :-) [17:30:49] haha [17:31:03] that query on db1095 gives 5 results too [17:31:27] or why I want to make regular runs of pt-table-checksum [17:31:53] ˜/jynus 18:29> let's move db1095 to another host -> what did you mean with that? [17:31:59] its master [17:32:30] sorry, i am lost now, how would you start its replication from another random host? [17:32:46] what do you mean how? [17:32:59] we have its master corrdinates [17:33:05] not only we have the gtid coords [17:33:17] but we can match them to another slave even without that [17:33:41] oh, gtid yes [17:33:47] even without that [17:34:30] ok [17:34:30] so [17:34:49] it is stopped at db1072-bin.003214:42374334 [17:34:59] we need to decide if we want an api host or a main traffic one, there are no more options [17:35:10] whatever works [17:35:17] which has a sane change_Tag structure [17:35:28] db1066 does [17:37:31] db1066 gives 5 rows for: SELECT ct_rc_id, ct_tag FROM change_tag GROUP BY ct_rc_id, ct_tag HAVING count(*) > 1 [17:37:38] so looks like db1052 in that regard [17:37:59] shall we go for db1066? [17:39:03] 10.016 [17:39:14] let's use the other, 65 [17:39:23] i am running the query there [17:39:34] let me prepare the puppet patch for it (I am reverting db1072's one) [17:40:26] db1065 gives 5 results too [17:40:35] the same as db1095 and db1052 [17:42:01] let me confirm the rows are not new [17:42:19] and that we didn't break db1072 today [17:42:57] they shouldn't keep in mind db1052 is stopped [17:43:19] no, what I mean is [17:43:49] there is a change we broke it by activating row binlog_format [17:43:53] I want to discard that [17:44:00] mmm [17:44:04] before we break more host, if we did [17:44:11] ok ok [17:45:37] also, this table makes no sense [17:45:43] it stores rc_ids [17:45:51] but those disappear [17:46:33] nah, I can see this happening since 20161226064653 [17:46:54] and much before, but rows on recentchanges have already been deleted [17:47:38] we should file a bug, independently of the schema [17:47:47] this is probably executing unsage statements [17:48:27] like insert ignore or replaces [17:48:34] nah, I can see this happening since 20161226064653 -> :o [17:49:24] so, ok, let's move the slave to 65 [17:49:35] let's consider 72 broken [17:51:38] ok [17:51:51] let me push the patch so we can forget about .72 [17:52:00] (the puppet patch) [17:52:01] which one? [17:52:08] to get it back to MIXED binlog [17:52:19] 10DBA, 06Labs, 10Tool-Labs: Reset password for database user p50380g50491 - https://phabricator.wikimedia.org/T155902#2965899 (10Tb) Great thanks. although I missed one in the list above; can you grant all to s51111 on p50380g50491_inconsistent_redirects on s1.labsdb also please. [17:52:21] ok, but forget about that [17:52:47] if it is broken it is not a priority [17:53:17] so we need to chose a host for vslow [17:53:34] I would actually start by restarting replication on db1052 [17:53:41] and moving the slave back there [17:53:53] so we have extra available slaves [17:54:39] or should we be ok? [17:54:58] no, i think that is a good idea [17:55:08] maybe we are ok [17:55:13] let's depool 65 [17:55:19] ok, let me do it [17:55:21] we still have 2 apis [17:55:28] restart it [17:55:49] in row [17:56:04] ok,. will depool and restart it in row [17:56:13] you take care of db1052? [17:56:24] well, it is ok like that [17:56:30] for now [17:56:45] you can do mediawiki [17:56:47] I will do puppet [17:56:51] ok [17:57:05] let's sync here [17:57:11] sure [18:03:03] db1065 depooled [18:03:29] so I am not sure about the order [18:03:44] I think we can do https://gerrit.wikimedia.org/r/333952 now? [18:03:49] checking [18:04:07] https://gerrit.wikimedia.org/r/333953 too? [18:04:09] yes, i think it is safe to do that [18:04:18] yep [18:04:24] i think we can go for both of them [18:04:27] will merge both, then [18:04:31] I have silenced alerts for db1065 [18:04:40] thanks, I had forgotten [18:04:49] db1072 is also silenced [18:08:31] ok, done [18:08:51] so next thing is to restart db1065 [18:08:57] I will take care of that [18:09:01] do you restart db1072? [18:09:12] no [18:09:15] not yet [18:09:19] ok [18:09:26] I do not like to have the master not on row [18:09:37] even if it is stopped or literally broken :-) [18:09:39] yeah [18:09:41] haha [18:09:44] step by step [18:09:47] running puppet on db1065 and then will restart mysql [18:09:54] again, that is only to help my workflow [18:10:01] not a general advice [18:10:08] prevents me from making mistakes [18:10:24] I already run puppet [18:10:33] when I meant merged I mean also locally applied [18:10:42] and checked the diff was done [18:10:46] yeah, i saw no changes [18:10:48] I am stopping mysql now [18:11:07] I double checked the successful depool [18:11:16] thanks :) [18:13:15] starting mysql [18:13:53] ok, so we have gtids, but I want to double check the coordinates [18:14:21] db1065's looks good, row based [18:15:00] I think after this, we should not wonder why labs was broken [18:15:13] but how production was not more broken than now [18:15:16] yeah, it is a good practical execercise [18:15:20] excercise [18:15:31] I had to disagree [18:15:34] *have [18:15:41] haha [18:15:50] you know what I meant!! :) [18:16:53] let me share an etherpad with you so you can see my calculations [18:16:59] ok [18:17:19] not for anything in particular, I just do not want you to be there 5 minutes [18:17:25] waiting [18:55:18] ok [18:55:25] so let's move db1095 [18:55:31] ok [18:55:41] let's go for it then after all the archeology [18:55:58] hopefully db1052 will catch up soon [18:56:09] and we can move the slave again, too [18:56:22] it is having a good pace [18:56:38] or we can do archeology again :-) [18:56:44] haha [18:56:54] i rather have dinner :p [18:57:04] it is catching up quite quickly [18:57:15] the script, btw, allos to change a master to sibling and viceversa [18:57:28] the repl.pl? [18:57:29] nice [18:57:32] which can be used to move a slave by doing it twice [18:57:36] but [18:57:44] I have not tested since we applied gtid and tls [18:57:47] may need update [18:58:09] we cannot use it here because we only have 2 servers with tow [18:58:11] *row [18:58:20] so we cannot move it in 2 steps [18:58:32] but we can use the "stop in sync" [18:58:37] once it catches up [18:59:03] we can move db1095 now, though [18:59:09] we do not need to wait [18:59:23] i will check that alert [19:03:03] 10DBA: db2060 not accessible - https://phabricator.wikimedia.org/T156161#2966076 (10Marostegui) [19:03:59] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: db2060 crashed (RAID controller) - https://phabricator.wikimedia.org/T154031#2966089 (10jcrespo) [19:04:01] 10DBA: db2060 not accessible - https://phabricator.wikimedia.org/T156161#2966088 (10jcrespo) [19:04:58] jynus: db1052 caught up [19:05:56] it doesnt matter, let's run the change master [19:06:53] * marostegui crosses his fingers [19:07:00] should I? [19:07:03] sure [19:08:53] i see it going thru [19:09:30] looks good [19:09:31] yeah [19:09:42] yes :) [19:09:58] look at the bright side [19:10:06] we didn't chose db1052 [19:10:09] I meant [19:10:12] db1072 [19:10:22] as the next master [19:10:28] shit [19:10:31] now you really scaredme [19:10:34] scared me [19:10:36] :| [19:10:40] why? [19:10:47] imagine if we had chosen it! [19:10:58] well, all replicas broken on production at the same time [19:11:09] it wouldn't have been the first time :-) [19:11:12] come on…only s1 :p [19:11:27] well, "I" broke commons [19:11:32] so no big deal [19:11:37] haha [19:11:45] what happened? [19:11:53] commons is more broken than enwiki [19:12:23] I think we will be ok, becase this is only 1 host, and it should only affect duplicates [19:12:48] commons archive is, or was broken when copying rows to image [19:12:52] and viceversa [19:13:05] sounds like a great way of breaking things [19:15:14] so if 52 and 65 are up to date [19:15:25] we can run the same repl command that this morning [19:15:36] but for 52 and 65 [19:15:41] do you have that handy? [19:16:00] if not, I can set it up [19:16:31] yes [19:16:43] we need to make sure however [19:16:57] that db1095 is up to date [19:17:04] ./repl.pl --stop-siblings-in-sync --host1=db1065.eqiad.wmnet:3306 --host2=db1052.eqiad.wmnet:3306 [19:17:07] yes [19:17:08] it will complain anyways [19:17:09] if it is delayed [19:17:15] no no [19:17:17] 95 [19:17:44] sanitarium [19:17:57] ˜/jynus 20:15> but for 52 and 65 ->? [19:18:09] we need to stop db1052 and db1065 at the same time, right? [19:18:13] the repl run is ok [19:18:14] yes [19:18:17] aaah [19:18:18] yes yes [19:18:18] and you can run it now [19:18:20] i know what you mean [19:18:23] I mean for the next step [19:18:27] yeah yeah [19:18:34] 95 is now 1 hour delayed [19:18:39] we need to wait for db1095 to catch up before run the master change [19:18:45] when you did it this morning [19:18:53] it was almost instant [19:19:25] just finished it [19:19:33] db1052 and db1065 stopped [19:19:43] so now wait [19:19:48] I can take it from here [19:20:12] I was going to reimage 66 after that [19:20:56] haha yeah, we were talking about db1066 3 hours ago XD [19:21:03] so, both are stopped at: 825236309 [19:21:07] db1065 and db1052 [19:21:22] SHOW MASTER STATUS is what we want [19:21:26] we can get it now [19:21:45] i am going to grab some dinner and rest a bit, the day is almost over :_( [19:21:53] please call me if needed [19:21:58] db1065-bin.003539 | 123581041 [19:22:04] ^just confirm [19:22:08] and bye [19:22:09] confirmed [19:22:18] it looks good [19:22:40] ok, see you tomorrow [19:22:47] remember the change master 's1' on db1095 (I am saying it because while setting up dbstore2001 i tended to forget it :) ) [19:22:57] it complains [19:23:03] so I am not too worried [19:23:06] :) [19:23:21] will see you tomorrow, please, call me if you need anything [19:23:28] thanks for all the help [19:40:32] 10DBA, 10MediaWiki-Change-tagging, 06Operations: db1072 change_tag schema and dataset is not consistent - https://phabricator.wikimedia.org/T156166#2966240 (10jcrespo) [19:44:01] 10DBA, 10MediaWiki-Change-tagging, 06Operations: db1072 change_tag schema and dataset is not consistent - https://phabricator.wikimedia.org/T156166#2966280 (10jcrespo) Adding @TTO and @Cenarium because they may know the actual right people to add to this ticket (probably not them) for the mediawiki bug side...