[09:07:14] let's put those codfw as soon as you do some smoke tests (even earlier than usual). We are running out of time. [09:08:27] ok, I'll run the counts now [09:09:34] we do not need to be 100% certain, just some fast-running ones to check nothing is lost [09:12:08] I setup neodymium and sarin as centralized query place [09:12:48] I deleted acces from iron, that was what caused the issue yesterday [09:13:29] because of the missing events? [09:13:30] a chain of problems - events where created with default definer (root@iron) [09:13:49] when those were deleted, tey stopped working [09:14:04] when they stopped working, long running queries were not killed [09:14:22] usualy there are no long running queries, but there is a couple of bugs creating those [09:14:37] so, 12 hours later, one stuck and created issues [09:15:06] as you can see, the stupidest change can have large consequences [09:15:42] happily, no users were affected [09:15:54] yep, complex chain of events [09:16:05] I have uploaded the code, which was nowere to the repo [09:16:17] saw the commit [09:16:33] and applied the right one to all servers, with some changes [09:16:47] mainly, disable the binlog (otherwise replication breaks) [09:17:02] and some masters still had the slave's code, so I change it [09:17:20] I left the old user around on sanitarium (db1069) [09:17:33] because it still owns all the triggers [09:17:55] and there will be like 5000 triggers to recreated [09:18:00] *be [09:19:08] not worth it, we will do it when those will be recreated for mariadb 10.1 [09:19:40] ok, make sense [09:20:24] in the end all goes back to technical debt [09:21:47] do not worry too much about yesterday, if no one had responded, mediawiki would have depooled that slave and everithing would have gone back to normal (as I have a 2nd watchdog on the other slave) [09:26:39] is it ok if I go to this failover meeting? there are a couple of things I need to ask [09:27:50] sure no prob [09:28:04] if needed I can join too, they moved from hangout so there should not be a limit [09:28:26] no problem with that, only if there is a limit [09:32:03] do you think it's ok for the count this? https://phabricator.wikimedia.org/T127330 (last comment) [09:34:34] actually, no because it will change on es2/3 with replication, put some limit somewhere based on the id [09:35:29] oh yes, did't specify... this is for the es1, working on the es2/3 [09:35:30] :) [09:35:42] also, do not try to count all tables or you may die waiting? [09:36:08] paralelize some, that is ok [09:36:36] but do some heuristics counting less rows or it will take forever [09:37:32] BTW, you have a pastebin equivalent on https://phabricator.wikimedia.org/paste/edit/form/14/ [09:37:59] oh cool, didn't know [09:38:22] they are count on primary key still will take so long? Ok I'll do some heuristicws [09:38:33] (remember we are talking 12-15TB of tables) [10:24:49] volans: https://phabricator.wikimedia.org/P2727 [10:25:25] check the time and see if that gives you an idea of how to repeat it/make it faster/slower, etc. [10:26:03] max(blob_id) is instant due to the primary key [10:26:45] ok, thanks [10:26:45] and reading 10 million cold records (without returning them) takes 5 minutes [10:27:18] count(1) does not do much for innodb vs count(*), due to the clusted index [10:28:02] it uses the PK/row, (which is the same on innodb) for jumping between recods [10:28:27] yes, old habit to do count(1) :) [10:30:03] it is ok [10:30:28] I seen it on many workmates that did it, not sure if for oracle or other db [10:30:50] I was just trying to be helpful :-) [10:31:56] and you are! [11:12:03] so on es2005 we have ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=16 (replicated to es2014) but we don't have it on es2006 and es2007 [11:14:23] yeah... we should revert that, but no time for the failover [11:14:47] we will just give es2014 a lower weight [11:15:00] ok [11:15:37] it was a test when codfw was inactive to see if it would give us one extra month [11:17:11] it only saved 41 GB [11:22:43] practically nothing :) [11:25:30] the other ops tend to not take me seriously when I say "there is only 100GB left on this machine", but I think filipo, alex and otto are the only ones that handle large amounts of data, too [12:18:09] I am thinking of creating the link db2016 -> db1057 instead to the master to avoid not-using tls, but that would mean that should be the new master after failover [12:18:34] I need to think about it, taking a lunch break [12:20:45] ok, I'll think about it too [14:06:42] updated https://phabricator.wikimedia.org/T127330 with results and link to paste. All seems reasonably good and times are promising on new servers [14:10:49] do you mean that you are happy and ready to failover or that the tests are running? [14:11:04] if the first, feel free to do it [14:12:12] I would like to do some proper checksums of all servers and all data at some point, but not a priority right now [14:12:49] and yes, the times are nice :-) [14:14:19] All the CRC32 that I've done match, the count too, schemas are the same, I can continue on with other schemas, but I'm reasonably confident that we are good [14:14:30] 100% agree [14:15:52] I'll change my pending repool adding them too with weight 1 [14:25:57] for background, here is why I am a bit paranoid when moving data around: https://phabricator.wikimedia.org/T26675 [14:27:12] interesting... [14:30:55] * volans brb [14:30:58] Re: repooling the only important thing is to make a "new node" the local master so replication doesn't break when they get filled up in a few days [14:42:47] yep, was my next change, didn't want to do all at once, but I can unify them :) [16:16:18] * volans need to go out for few minutes, bbl [16:58:44] * volans back [18:03:30] how did you see it? [18:04:07] what? [18:05:11] the state of the failover process, confident about it? [18:06:09] I would like to see an explicit timeline for all the services switch sync [18:07:14] yep, but we are part of that, too [18:09:02] from our side do we have already the list of commands? is not in wikitech [18:09:54] no, that is the part I said it was not yet done [18:10:05] do we plan to use salt to change master? etc... [18:10:21] I wanted to summarize the ones do not depend on mediawiki deployment on a script [18:10:51] and invite others to edit the same script [18:11:05] agree, and, if you agree, we could "prepare" the topology so that given a shard we have: [18:11:15] master -> new_master -> slaves [18:11:31] yes, well, on codfw there is nothing to prepare [18:11:44] we only need to see the new masters on eqiad [18:12:04] yes on eqiad [18:12:23] however, I think it will be easier to failover from the current masters, and then failover the eqiad clusters [18:12:45] works too [18:12:46] which means no TLS codfw -> eqiad for a few seconds [18:13:20] the reason being that the circular replication requires going back to the original master [18:13:41] otherwise we will break replication executing an infinite number of times the transactions [18:14:02] we could not do circular replication [18:14:19] yes of course [18:14:47] but then the topology gets stranger, takes more time, etc [18:15:09] the main problem being mediawiki doesn's support multi-tier slaves [18:15:43] so Master -> Secondary Master -> slaves doesn't really work [18:15:58] ah true, sorry forgot the auto-depooling system [18:16:09] it is the chronology checker [18:16:17] it is based on binlog [18:16:33] and binlog from master is not shared with the slaves in the 3rd tier [18:16:49] we will fix that, either with pt-heartbeat or gtids [18:16:52] but not on time for it [18:16:59] clearly not [18:17:00] : [18:17:02] :) [18:17:03] (for next week) [18:17:07] or 2 weeks [18:17:16] we depend on mediawiki devels [19:04:05] should I depool old es2001-es2009 too or we wait the really finish the space? what's the plan for those servers in their next life? :) [19:05:24] decomm, probably [19:05:31] which will need its ticket :-) [19:06:22] let's depool all the old ones already (but keep them replicating, at least until they fail) [19:06:33] and close the ticket [19:06:40] icinga? [19:06:50] that is part of the decom process [19:07:05] to avoid pages when they fail [19:07:14] https://wikitech.wikimedia.org/wiki/Server_Lifecycle [19:07:47] you can disable icinga if you want, too, but eventually will be eliminated [19:08:03] ok, just want to avoid unnecessary pages [19:08:07] sure [19:10:48] can you also check the es parts of https://gerrit.wikimedia.org/r/#/c/267659/8/wmf-config/db-codfw.php [19:11:30] sure [19:11:58] it could be a rebase artifact, or an existing issue, but as you are probably connected to it [19:13:03] I think it is an existing issue [19:13:17] I have having the masters on 2 files, that is error prone [19:14:34] you need to rebase right? [19:15:41] yes, that is not on top of the last config [19:15:52] is it ok live? [19:16:42] yes my changes are already live (git and tendril tree) [19:16:49] no, I know [19:17:13] I mean if the db-codfw.php config has the same issue (pointing to the wrong master) [19:17:25] the one currently active [19:18:02] yes '10.64.16.187' => 0, # es1019, master [19:18:40] the ip is wrong, master is es1019 [19:18:52] ok, I will fix that [19:18:58] should be 10.64.48.116 [19:19:36] I see it now, compared to eqiad [19:20:01] I am going to get blind one of these days [19:20:08] probably some old copy/paste/merge between the 2 files? [19:20:27] yeah, I failovered es1* master several times [19:20:42] it is way easier than db s? hosts [19:21:18] so, now that you are nearly finish, do you know that you want to work next? [19:21:49] I will need you of course, at least helping with failover [19:22:06] I need to finish the SSL stuff and I guess we'll have a lot of stuff for the failover [19:22:15] yep [19:22:34] how do you see the ssl thing, is it blocked or could be done on time for the failover? [19:23:21] BTW, I will be out on the 24, 25 and 28 [19:24:09] I'll speak with Guillame tomorrow morning and get the status of it's patch, they need it too so we should get it hopefully tomorrow itself [19:24:22] nice [19:24:24] I'll give you an update tomorrow morning, I think is doable [19:24:53] if we cannot do a whole CA thing, at least having it prepared for when eqiad masters failover [19:25:22] as all other servers are easily restartable except primary datacenter masters [19:26:31] also, if you are already bored setting up servers, I can take the next batch myself [19:26:38] would you be ok to turn SSL off for local replicas (no cross DC) for like a day? [19:29:03] better not :)