[04:38:19] 10DBA, 10CheckUser, 03Community-Tech-Sprint: Investigation: Add old and new length columns to cu_changes - https://phabricator.wikimedia.org/T155734#2963906 (10MusikAnimal) [07:56:08] 10DBA, 06Operations, 10Wikimedia-General-or-Unknown: Spurious completely empty `image` table row on commonswiki - https://phabricator.wikimedia.org/T155769#2964262 (10Marostegui) >>! In T155769#2962307, @matmarex wrote: >>>! In T155769#2960504, @Marostegui wrote: >> If you guys consider it is safe to delete,... [08:13:16] 10DBA, 13Patch-For-Review: Fix dbstore2001 and dbstore2002 - https://phabricator.wikimedia.org/T130128#2964286 (10Marostegui) I have deployed the change that split the classes into dbstore (tokuDB) and dbstore2 (InnoDB) [08:14:38] 10DBA, 07Epic, 13Patch-For-Review: Decouple roles from mariadb.pp into their own file - https://phabricator.wikimedia.org/T150850#2964287 (10Marostegui) I have pushed the change that decouples dbstore from mariadb.pp: https://gerrit.wikimedia.org/r/332228 It split it into dbstore (runs tokuDB) and dbstore2 (... [08:15:24] 10DBA, 13Patch-For-Review: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418#2964288 (10Marostegui) gtid_domain_id pushed to dbstore2 (that is dbstore2001 and dbstore2002). [08:47:07] 10DBA, 10Wikidata, 07Performance, 15User-Daniel, and 2 others: Build an environment to test change dispatching using Redis-based locking - https://phabricator.wikimedia.org/T155190#2964337 (10WMDE-leszek) [09:06:37] 10DBA, 06Labs, 10Tool-Labs: Reset password for database user p50380g50491 - https://phabricator.wikimedia.org/T155902#2964387 (10Marostegui) We granted you all on: `p50380g50491_common` the other databases didn't exist on either labs or tool boxes. [09:25:46] 10DBA, 06Operations, 10ops-eqiad: Move db1051 to row B3 - https://phabricator.wikimedia.org/T156004#2964458 (10Marostegui) Alerts silenced for 24 hours - I will re-enable them once the move is done. [09:27:04] 10DBA, 06Operations, 10ops-eqiad, 13Patch-For-Review: Move db1052 to row B3 - https://phabricator.wikimedia.org/T156006#2964460 (10Marostegui) Alerts silenced for 24 hours - I will re-enable them once the move is done. [11:45:51] jynus: yep, we need to decide which one can be db1095 new master, I thought about db1057 once it is no longer a master? [11:47:08] ok [11:48:20] so when we do the switchover, while still in read only and once db1057 isn't the master, we change its binlog to ROW and repoint db1095 [11:49:31] that doesn't work because you need to change db1052 to statement [11:49:58] and db1095 will receive statement events even if stopped [11:50:10] ah, the events… [11:50:18] no, edits [11:50:33] the failover will not be instantaneous [11:50:34] i thought about doing all that while in read only [11:51:15] if not, we can try to select another host to be its master, but there are not many options here…maybe one of the big ones? [11:51:31] it doesn't matter, we just need a buffer [11:51:43] temporarelly [11:51:55] we can use db1072 [11:52:02] it doesn't matter if it lags [11:52:18] that's true [11:54:08] let's do that [11:54:21] we need to check if the script to stop the slaves in the same position works [11:54:24] and restart and upgrade db1052 [11:56:31] we can do that manually if needed [11:57:28] stop slave db1052 - wait 5 seconds - stop slave db1073 - start slave on db1052 until db1073's position [11:58:30] yeah, that will work too [11:58:41] dumps seem to have finished [11:58:49] so we can depool it easily [11:58:49] I was checking the script and apparently this would be it: ./repl.pl --stop-siblings-in-sync --host1=db1072.eqiad.wmnet:3306 --host2=db1052.eqiad.wmnet:3306 [11:58:55] yes [11:59:14] let me depool it [11:59:17] which basically does the same [12:00:06] but before doing that we need to change the binlog format [12:00:09] yep [12:00:23] I am depooling db1072 [12:00:25] and make sure it is in use [12:01:29] come on git…be fast [12:02:26] note we should be focusing on depooling things [12:02:35] this can be done after the server movement [12:02:42] i have the patch uploaded to debool db1051 [12:02:51] i was just waiting a bit closer to the date [12:03:02] sure [12:03:17] do you want me to reimage the other api host? [12:03:32] I repooled the one you did yesterday and recovered its normal weight [12:03:55] I do not want to do more depools than necesary now [12:04:03] I can do it after the movement [12:04:16] yes, totally agree :) [12:04:34] btw, I added the racks to each server in s1 (so far) [12:05:17] do we switchover tomorrow at 7am ? [12:06:05] I would like to have db1052 running for at least 24 hours in its new rack, I might be too paranoid [12:06:28] maybe thursday 7am? [12:08:53] ok [12:10:03] let's send an email to ops once we have the movements done to let them know that we are aiming for thursday 7am [12:10:06] ? [12:10:17] ok [12:12:50] should we move away one of the api D1 servers? [12:12:51] D1: Initial commit - https://phabricator.wikimedia.org/D1 [12:13:07] or just stop using it in the future? [12:13:54] I would move one of them out. not nice to have all of them on the same rack [12:14:27] but not using them in the future would be a better long term solution indeed [12:18:01] lets move the racks at the beginning of the comment [12:18:14] sure [12:18:25] let me create the task to move one of the API servers [12:18:26] because of the other comments [12:18:33] any preference? [12:18:40] it is hard to read [12:19:10] maybe we can just swap them for one of the ones we are moving? [12:20:08] sorry, I meant any preference on which server to move from api? [12:20:41] "which server to move from api"? [12:21:21] yes, you suggested to move one of the API servers away as they are all on the same rack (D1), which I agree with, although as D1 isn't under trouble now, there is no need to do it today or tomorrow [12:21:21] D1: Initial commit - https://phabricator.wikimedia.org/D1 [12:21:28] I was just creating the task so we do not forget [12:22:02] I do not know, they are all the same except 66, which is not upgraded [12:22:18] ok, I will just get db1073 then [12:24:21] so what things are needed in the next 2 hours? [12:24:38] push the patch to depool db1051 [12:24:49] stop replication on db1095 before we move db1052 [12:25:16] silence alerts for db1051 and db1052 (already done) [12:25:24] stop 1051 and 52 [12:25:28] and once the moves are done, we can proceed and reimage the pending api host [12:25:40] ok [12:26:42] then we can point db1095 to the new 52 ip, depool 73, change the format, change the master [12:26:52] yep [12:27:06] 10DBA, 06Operations: Move db1073 to B3 - https://phabricator.wikimedia.org/T156126#2964856 (10Marostegui) [12:30:55] ok, I will have lunch now so I am around later [12:31:10] enjoy! [12:31:26] 10DBA, 06Labs, 06Operations, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2964878 (10Marostegui) [12:31:50] 10DBA, 06Labs, 06Operations, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2961118 (10Marostegui) [12:51:18] 10DBA, 06Labs, 06Operations, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2964931 (10Marostegui) [13:28:43] jynus: tendril needs to be updated manually with the new ip? I saw the dns table but I am unsure whether it is populated automatically or not [13:29:37] I do not think so [13:29:46] I think it does everthing based on dns [13:30:13] ok - I will wait a couple of hours [13:30:16] thanks! [13:30:40] wait for what? [13:30:49] for it to see if it gets updated [13:30:55] (tendril) [13:31:02] I do not understand [13:31:16] what has been changed already? [13:31:16] sorry, i didn't explain myself correctly [13:31:20] db1051 :) [13:31:25] db1051 is back up with its new ip [13:31:30] oh [13:31:35] I didn't notice that [13:31:41] yeah, cmjohnson1 is super fast! [13:31:51] I was updating the ticket [13:31:55] yes, it's there [13:31:58] check tendril [13:32:16] oh [13:32:18] already changed :) [13:33:01] I thought it was going to happen at 14UTC [13:33:10] yeah, but chris was onsite earlier and pinged me [13:33:12] so we went ahead :) [13:33:33] 10DBA, 06Labs, 06Operations, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2965022 (10Marostegui) [13:33:36] 10DBA, 06Operations, 10ops-eqiad, 13Patch-For-Review: Move db1051 to row B3 - https://phabricator.wikimedia.org/T156004#2965019 (10Marostegui) 05Open>03Resolved a:03Cmjohnson db1051 has been moved. DNS updated db-eqiad,codfw files updated mysql and replication started finely. tendril updated Thanks... [13:53:41] marostegui: here? [13:53:45] yep [13:53:47] ok [13:53:55] i was just talking to jynus about the master switches etc for C2 [13:54:32] i understand you guys also discussed it? [13:54:47] yep a bit [13:54:48] should I read your discussion? [13:54:56] you mean about announce it or not, right? [13:55:09] i think if we are relatively sure we can do it in < 2 minutes, we shouldn't need to [13:55:13] we used to do master switches all the time [13:55:22] and I would like us to get back to a place where we do them more often and they aren't a big deal [13:55:26] (ideally also faster) [13:55:44] agreed - my only worry, which I told jynus about, was: what if it goes wrong or takes longer? [13:55:52] things can always go wrong [13:55:55] even now, without us doing anything [13:56:16] are there any specific reasons why we think things may go wrong with a high likelihood? [13:56:26] the thing is, we have a way to revert quickly [13:56:32] exactly [13:56:37] then let's not make a big deal out of this [13:56:44] we're doing this -exactly- because something else might go wrong anytime [13:56:47] No, I don't think it will go wrong [13:57:06] we can't always announce things a week in advance, that makes us less agile [13:57:17] true [13:57:26] instead, lets make sure we do these more often again, make them quicker, more automated, more routine [13:57:42] so jaime was asking me if we should test the new changes in mediawiki along the way [13:57:47] and I'd say let's -not- do this now [13:57:50] let's instead also do this, later [13:58:04] with or without etcd integration [13:58:14] I agree, the less variables the better (and the faster probably) [13:58:18] exactly [13:58:50] so if we can revert quickly, that seems fine [13:59:47] I think we can yes [14:00:35] cool [14:00:43] then I'd say you guys can do this at your own schedule, outside deployment windows [14:01:02] we were thinking about thursday 7am (planning to send an email a bit later) [14:01:38] ok [14:01:53] 7 am which timezone? [14:02:10] cet [14:02:14] use utc only :) [14:02:18] ah ok :) [14:02:22] 6utc then [14:02:23] I meant 7 UTC [14:02:26] aaah [14:02:29] when I proposed that [14:02:31] haha [14:02:34] haha [14:02:34] XD [14:02:39] that works [14:02:44] i won't be around, i have a dr's appt [14:04:18] no worries [14:10:33] 10DBA, 06Labs, 06Operations, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2965171 (10Marostegui) [14:10:36] 10DBA, 06Operations, 10ops-eqiad, 13Patch-For-Review: Move db1052 to row B3 - https://phabricator.wikimedia.org/T156006#2965168 (10Marostegui) 05Open>03Resolved a:03Cmjohnson db1051 has been moved. DNS updated db-eqiad,codfw files updated mysql and replication started finely. tendril updated thanks... [14:11:36] jynus: db1051 and db1052 have been moved [14:11:49] db1052 was obviously powered off and booted up with no issues, do you still want to reboot it? [14:21:41] also, planning to warm up db1051 like this: https://gerrit.wikimedia.org/r/#/c/333911/ what do you think? [14:27:21] I would like to upgrade it (specially the kernel), holding the mariadb package [14:27:31] and that probably requires reboot [14:27:34] sure [14:28:02] specially important for linked libraries like tls [14:28:05] openssl [14:28:46] sure, sounds good [14:28:54] following mark's advice, we could setup a day for small maintenance windows [14:29:10] every week (not necesarilly to do it every week) [14:29:16] yeah that's not a bad idea [14:29:21] that wasn't my advice though ;) [14:29:25] I know [14:30:43] it is a good idea yeah [14:31:09] specially to research when is the low load for every shard [14:31:40] I think enwiki and commons is at 7UTC [14:34:06] shall I reimage db1066, anything more important? [14:34:18] no, go ahead [14:34:29] once it is done, we can work out db1095->db1073 [14:34:44] but let's get that done as it is more important :) [14:34:53] not sure [14:35:00] I can do that later in my day [14:35:12] 95 -> 73 can be done faster [14:35:35] ok, let me repool db1051 (https://gerrit.wikimedia.org/r/#/c/333911/) with lower weight and then we can depool db1073 [14:35:49] ok [14:46:07] 10DBA, 10Wikidata, 13Patch-For-Review, 07Performance, and 3 others: Implement ChangeDispatchCoordinator based on RedisLockManager - https://phabricator.wikimedia.org/T151993#2965254 (10Andrew) [14:46:15] 10DBA, 06Labs, 10Wikidata, 07Performance, and 3 others: Increase quota for wikidata-dev project - https://phabricator.wikimedia.org/T155042#2965251 (10Andrew) 05Open>03Resolved a:03Andrew It looks to me like there's quite a bit of quota room in wikidata-dev already. I increased the max instance coun... [14:46:45] marostegui, did you change anything on dbstore2001 yesterday around 9? [14:47:01] no, yesterday I started to compress s2 on dbstore [14:47:10] ah, ok [14:47:11] but that was around noon if I recall correctly [14:47:17] ok, ok [14:47:19] (it is still on going) [14:47:23] IOPS was over the place [14:47:27] and I was worried [14:47:39] bad buffer pool stats, etc [14:47:43] no problem, then [14:47:44] no, this time it was me :) [14:56:48] 10DBA, 06Labs, 06Operations, 07Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#2965291 (10jcrespo) [14:56:51] 10DBA, 06Labs, 10Tool-Labs: enwiki_p replica on s1 is corrupted - https://phabricator.wikimedia.org/T134203#2965290 (10jcrespo) 05Open>03Resolved [15:00:05] jynus: I need another pair of eyes here, can you check if db1052 and db1073 are stopped on the same position: 232441149 [15:00:39] did you use the script? [15:00:42] yes [15:01:16] 10DBA, 10Analytics, 10Analytics-Cluster: Purge MobileWebWikiGrok_* and MobileWebWikiGrokError_* rows older than 90 days - https://phabricator.wikimedia.org/T77918#2965310 (10Ottomata) 05Open>03declined [15:02:12] but one is using mixed binlog_format [15:02:24] which makes no sense for what you want to do [15:02:27] yes, I haven't touched those yet [15:02:43] then why stop them? [15:02:46] just as a test? [15:02:51] oh - I was testing the script sorry [15:02:52] yes [15:02:56] ok ok [15:03:27] sorry to be asking some many checks, but it is a delicate operation as we can mess up db1095 and I prefer to have another pair of eyes here :) [15:04:42] yes, it worked [15:05:03] ok, going to start replication again and change the binlog to row on db1073 [15:05:05] the only place it doesn't work is on delayed slaves [15:06:16] or very lagging slaves [15:06:25] yeah I can imagine it goes a bit crazy [15:06:53] once we have gtid everywhere, it will be easier [15:07:57] ok, db1073 is now running ROW [15:08:20] check that with mysqlbinlog, reset the file, etc. [15:08:27] yep, i have flushed the log [15:08:48] I stressed that because doing it hot sometimes has issues [15:09:01] I think running transactions long in the old format [15:09:08] because the have cached the old format [15:09:15] ah, interesting [15:09:27] which is why I didn't want to do everthing in a single operation [15:09:34] indeed [15:09:43] *log [15:12:20] still running statement based :) [15:12:27] yes [15:12:49] so I was right to worry :-) [15:12:53] totally right [15:12:57] we might need to restart [15:13:18] btw: https://gerrit.wikimedia.org/r/#/c/333850/ (i included db1073 on the same change) [15:13:50] maybe let's separate the 73 [15:14:04] so we run without hacks [15:14:19] from the 52 change? [15:14:28] you mean split the two commits? [15:14:40] or just applying that, not restart 52 [15:14:42] ? [15:15:01] Ah, I see your point [15:15:03] let's split [15:15:08] let's not mess things yes [15:15:33] let's commit .73 first [15:15:36] and restart 73 [15:15:46] I do not mind [15:15:53] but I like to restart with final configs [15:16:08] so I am 100% I do not get an error in half a year [15:16:14] yep [15:16:14] *sure [15:16:27] let me split db1073 first [15:16:36] so we can restar tit, make sure it works and then repooint db1095 [15:16:42] and then restart db1052 with no rush [15:17:05] note those are just my own way of working [15:17:23] because for me it helps me keep track of 10 things at the same time [15:17:30] no, but I agree, I am scared of pushing that and if for whatever reason db1052 gets restarted now…db1095 bye [15:17:46] keep on puppet the real state [15:17:54] rather than the desided state [15:18:19] whith very few exceptions like the tls certs, etc. [15:20:09] you abandoned the commit but it was easier to repurpose it [15:20:22] or just create another and rebase the old one [15:20:52] i had a bit of a mess with my branches already, so I didn't want to mess it even more, better a fresh start (at least for my mind) [15:38:20] 10DBA, 06Labs, 10Tool-Labs: Reset password for database user p50380g50491 - https://phabricator.wikimedia.org/T155902#2965491 (10Tb) Apologies, double-underscores seem to get eaten by Phabricator's markup parser. On s1.labsdb: ``` p50380g50491_common p50380g50491__rlrl_enwiki_p p50380g50491__rlrl_ptwiki_p... [15:45:39] 10DBA, 06Labs, 10Tool-Labs: Reset password for database user p50380g50491 - https://phabricator.wikimedia.org/T155902#2965532 (10Marostegui) Thanks for the clarification. I have granted access to those databases. Please check them and let us know if that works! [16:04:15] 10DBA, 06Labs, 10Tool-Labs: Reset password for database user p50380g50491 - https://phabricator.wikimedia.org/T155902#2958263 (10jcrespo) @Marostegui and others ops, grants are wildcards, **never** use _ without escaping (\_) on a grant. It is not a big deal here, but it can lead to security problems. [16:16:33] 10DBA, 06Labs, 10Tool-Labs: Reset password for database user p50380g50491 - https://phabricator.wikimedia.org/T155902#2965610 (10jcrespo) @Tb your grants have been added- you should be able to access old data- however, you should consider those grants temporary, until you rename the databases to start with `... [16:17:31] jynus: let me know when you are around [16:17:46] I am here [16:17:51] so, https://gerrit.wikimedia.org/r/#/c/333926/2 [16:17:59] yes, merge [16:18:01] I can deploy that, and restart db1072 [16:18:52] I can do that if you want, I am blocked on the reimage for that [16:18:54] note that the host is db1072 (the one that belongs to vslow, dump) not db1073 as I said previously (which is API) [16:19:10] yes, that makes more sense to me [16:19:13] :) [16:19:41] although maybe we should depool it? [16:19:52] i did [16:20:17] looks pooled on: https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php [16:20:36] mmm [16:22:04] check now :) (forgot the fetch+rebase) so many things at the same time [16:22:26] where may I heard that before? [16:22:35] haha [16:26:07] going to restart mysql on db1072 [16:27:26] let's do it fast, I do not like having that role on 80 for long [16:27:32] yep [16:27:51] it is restarting now [16:29:30] would you have chosen an API host to help with vslow? [16:30:00] no, any is a bad idea, I just want it pooled back fastly [16:30:11] bad idea for long time [16:30:13] ok [16:30:28] it tends to create a large ibdata1 [16:30:35] because 24-hour selects [16:30:40] :| [16:30:43] right [16:30:49] (undo section) [16:31:01] the less times, the less likely we have such issues [16:31:10] *time it is there [16:31:29] sure [16:31:35] still restarting [16:32:08] will prepare the revert meanwhile [16:34:05] starting mysql now [16:34:48] which version was that running? [16:35:02] .23 [16:35:08] ok [16:37:28] the binlog looks good now [16:38:11] nice [16:38:17] sorry for so much pain [16:38:21] let's stop both hosts [16:38:24] it will be easier in the future [16:38:34] as we are aiming for row (maybe) [16:38:34] it is not your fault! :) [16:38:38] 10DBA, 06Labs, 10Tool-Labs: enwiki_p replica on s1 is corrupted - https://phabricator.wikimedia.org/T134203#2965716 (10Superyetkin) Could you please share the configuration details of new servers? Most of my tools [[http://tools.wmflabs.org/superyetkin/kategorisizsayfalar.php | like this]] (running on trwiki... [16:43:55] jynus: both servers stopped at 477438361 - please double check [16:44:23] 10DBA, 06Labs, 10Tool-Labs: enwiki_p replica on s1 is corrupted - https://phabricator.wikimedia.org/T134203#2965719 (10jcrespo) @Superyetkin I cannot guarantee it will not change in the future, but you can connect, in the case of **enwiki** to the `labsdb-web.eqiad.wmnet` host for short-lived, web-like reque... [16:44:52] if that is ok, this is what I would run on db1095: https://phabricator.wikimedia.org/P4795 [16:45:23] no [16:45:24] that is wrong [16:45:36] yes, it is wrong [16:45:42] you want the master coordinates [16:45:42] just changed [16:45:44] yep [16:45:48] just changed it [16:46:31] it should use ssl, not sure if it already done, so it is not needed [16:47:24] db1095 isn't using ssl at the moment [16:48:00] ok, adding MASTER_SSL=1 should fix that [16:48:07] at the end [16:48:28] just added, refresh :) [16:48:30] it doesn't hurt, if there is any problem, it will complain [16:48:44] just start the io_thread first [16:48:52] (shouln't be needed anyway) [16:49:13] looks good [16:49:45] ok [16:49:47] let's go then [16:51:45] io thread running [16:52:00] let's go for sql [16:52:43] 10DBA, 06Labs, 10Wikidata, 07Performance, and 3 others: Increase quota for wikidata-dev project - https://phabricator.wikimedia.org/T155042#2965728 (10Ladsgroup) We cleaned some instances and it's okay now. Probably we will make more soon. [16:52:56] both running [16:53:04] let's start replication on db1072 then? [16:54:47] yes [16:54:54] let's go then [16:55:15] done [16:55:29] looking good [16:55:40] crashed [16:55:53] what? [16:55:58] dup entry [16:56:09] are you for real? [16:56:22] yes :( [16:56:45] well, you know what that means [16:56:50] i know yes :( [16:57:55] i was king of expecting it to be honest [16:58:04] why? [16:58:21] i wasn't completely convinced that db1072 and db1052 would have exactly the same data [16:58:49] you think it is not a coordinate problem? [16:59:03] let's try to get rid of this row [16:59:06] and see what happens [17:00:49] wait [17:00:53] sure [17:00:56] which entry did you move to? [17:01:07] i haven't deleted anything [17:02:19] I mean you changed to log pos 42375201 [17:02:24] no [17:02:28] 38678449 [17:02:34] yes [17:03:00] yeah, that is 4 MBs of changes [17:05:44] it is also the great change_tag table, no PK there [17:09:25] it could be just a "simple" schema change drift [17:09:52] I see the same insert twice [17:10:03] on the binlog [17:10:05] interesting... [17:10:43] and probably it is that [17:10:46] is it exactly the same? [17:10:50] not at all [17:10:54] one has unique keys [17:10:58] the other don't [17:11:26] have you altered data or anything yet? [17:11:32] no [17:11:39] nothing [17:11:49] this is a mess [17:11:58] and that is why master changes are a problem [17:12:08] not because operational problems [17:12:11] content is a mess [17:12:39] yes :( [17:12:56] so s1-master contains the unique kets [17:13:08] I do not get it [17:13:16] db1052 also has unique keys [17:13:19] maybe it was an insert ignore? [17:13:51] run: /opt/wmf-mariadb10/bin/mysqlbinlog --no-defaults /srv/sqldata/db1072-bin.003214 --start-position=42374334 -vv --base64-output=DECODE-ROWS | less [17:13:55]