[09:07:14] <jynus>	 let's put those codfw as soon as you do some smoke tests (even earlier than usual). We are running out of time.
[09:08:27] <volans>	 ok, I'll run the counts now
[09:09:34] <jynus>	 we do not need to be 100% certain, just some fast-running ones to check nothing is lost
[09:12:08] <jynus>	 I setup neodymium and sarin as centralized query place
[09:12:48] <jynus>	 I deleted acces from iron, that was what caused the issue yesterday
[09:13:29] <volans>	 because of the missing events?
[09:13:30] <jynus>	 a chain of problems - events where created with default definer (root@iron)
[09:13:49] <jynus>	 when those were deleted, tey stopped working
[09:14:04] <jynus>	 when they stopped working, long running queries were not killed
[09:14:22] <jynus>	 usualy there are no long running queries, but there is a couple of bugs creating those
[09:14:37] <jynus>	 so, 12 hours later, one stuck and created issues
[09:15:06] <jynus>	 as you can see, the stupidest change can have large consequences
[09:15:42] <jynus>	 happily, no users were affected
[09:15:54] <volans>	 yep, complex chain of events
[09:16:05] <jynus>	 I have uploaded the code, which was nowere to the repo
[09:16:17] <volans>	 saw the commit
[09:16:33] <jynus>	 and applied the right one to all servers, with some changes
[09:16:47] <jynus>	 mainly, disable the binlog (otherwise replication breaks)
[09:17:02] <jynus>	 and some masters still had the slave's code, so I change it
[09:17:20] <jynus>	 I left the old user around on sanitarium (db1069)
[09:17:33] <jynus>	 because it still owns all the triggers
[09:17:55] <jynus>	 and there will be like 5000 triggers to recreated
[09:18:00] <jynus>	 *be
[09:19:08] <jynus>	 not worth it, we will do it when those will be recreated for mariadb 10.1
[09:19:40] <volans>	 ok, make sense
[09:20:24] <jynus>	 in the end all goes back to technical debt
[09:21:47] <jynus>	 do not worry too much about yesterday, if no one had responded, mediawiki would have depooled that slave and everithing would have gone back to normal (as I have a 2nd watchdog on the other slave)
[09:26:39] <jynus>	 is it ok if I go to this failover meeting? there are a couple of things I need to ask
[09:27:50] <volans>	 sure no prob
[09:28:04] <volans>	 if needed I can join too, they moved from hangout so there should not be a limit
[09:28:26] <jynus>	 no problem with that, only if there is a limit
[09:32:03] <volans>	 do you think it's ok for the count this? https://phabricator.wikimedia.org/T127330 (last comment)
[09:34:34] <jynus>	 actually, no because it will change on es2/3 with replication, put some limit somewhere based on the id
[09:35:29] <volans>	 oh yes, did't specify... this is for the es1, working on the es2/3
[09:35:30] <volans>	 :)
[09:35:42] <jynus>	 also, do not try to count all tables or you may die waiting?
[09:36:08] <jynus>	 paralelize some, that is ok
[09:36:36] <jynus>	 but do some heuristics counting less rows or it will take forever
[09:37:32] <jynus>	 BTW, you have a pastebin equivalent on https://phabricator.wikimedia.org/paste/edit/form/14/
[09:37:59] <volans>	 oh cool, didn't know
[09:38:22] <volans>	 they are count on primary key still will take so long? Ok I'll do some heuristicws
[09:38:33] <jynus>	 (remember we are talking 12-15TB of tables)
[10:24:49] <jynus>	 volans: https://phabricator.wikimedia.org/P2727
[10:25:25] <jynus>	 check the time and see if that gives you an idea of how to repeat it/make it faster/slower, etc.
[10:26:03] <jynus>	 max(blob_id) is instant due to the primary key
[10:26:45] <volans>	 ok, thanks
[10:26:45] <jynus>	 and reading 10 million cold records (without returning them) takes 5 minutes
[10:27:18] <jynus>	 count(1) does not do much for innodb vs count(*), due to the clusted index
[10:28:02] <jynus>	 it uses the PK/row, (which is the same on innodb) for jumping between recods
[10:28:27] <volans>	 yes, old habit to do count(1) :)
[10:30:03] <jynus>	 it is ok
[10:30:28] <jynus>	 I seen it on many workmates that did it, not sure if for oracle or other db
[10:30:50] <jynus>	 I was just trying to be helpful :-)
[10:31:56] <volans>	 and you are!
[11:12:03] <volans>	 so on es2005 we have ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=16 (replicated to es2014) but we don't have it on es2006 and es2007
[11:14:23] <jynus>	 yeah... we should revert that, but no time for the failover
[11:14:47] <jynus>	 we will just give es2014 a lower weight
[11:15:00] <volans>	 ok
[11:15:37] <jynus>	 it was a test when codfw was inactive to see if it would give us one extra month
[11:17:11] <jynus>	 it only saved 41 GB
[11:22:43] <volans>	 practically nothing :)
[11:25:30] <jynus>	 the other ops tend to not take me seriously when I say "there is only 100GB left on this machine", but I think filipo, alex and otto are the only ones that handle large amounts of data, too
[12:18:09] <jynus>	 I am thinking of creating the link db2016 -> db1057 instead to the master to avoid not-using tls, but that would mean that should be the new master after failover
[12:18:34] <jynus>	 I need to think about it, taking a lunch break
[12:20:45] <volans>	 ok, I'll think about it too
[14:06:42] <volans>	 updated https://phabricator.wikimedia.org/T127330 with results and link to paste. All seems reasonably good and times are promising on new servers
[14:10:49] <jynus>	 do you mean that you are happy and ready to failover or that the tests are running?
[14:11:04] <jynus>	 if the first, feel free to do it
[14:12:12] <jynus>	 I would like to do some proper checksums of all servers and all data at some point, but not a priority right now
[14:12:49] <jynus>	 and yes, the times are nice :-)
[14:14:19] <volans>	 All the CRC32 that I've done match, the count too, schemas are the same, I can continue on with other schemas, but I'm reasonably confident that we are good
[14:14:30] <jynus>	 100% agree
[14:15:52] <volans>	 I'll change my pending repool adding them too with weight 1
[14:25:57] <jynus>	 for background, here is why I am a bit paranoid when moving data around: https://phabricator.wikimedia.org/T26675
[14:27:12] <volans>	 interesting...
[14:30:55] * volans brb
[14:30:58] <jynus>	 Re: repooling the only important thing is to make a "new node" the local master so replication doesn't break when they get filled up in a few days
[14:42:47] <volans>	 yep, was my next change, didn't want to do all at once, but I can unify them :)
[16:16:18] * volans need to go out for few minutes, bbl
[16:58:44] * volans back
[18:03:30] <jynus>	 how did you see it?
[18:04:07] <volans>	 what?
[18:05:11] <jynus>	 the state of the failover process, confident about it?
[18:06:09] <volans>	 I would like to see an explicit timeline for all the services switch sync
[18:07:14] <jynus>	 yep, but we are part of that, too
[18:09:02] <volans>	 from our side do we have already the list of commands? is not in wikitech
[18:09:54] <jynus>	 no, that is the part I said it was not yet done
[18:10:05] <volans>	 do we plan to use salt to change master? etc...
[18:10:21] <jynus>	 I wanted to summarize the ones do not depend on mediawiki deployment on a script
[18:10:51] <jynus>	 and invite others to edit the same script
[18:11:05] <volans>	 agree, and, if you agree, we could "prepare" the topology so that given a shard we have:
[18:11:15] <volans>	 master -> new_master -> slaves
[18:11:31] <jynus>	 yes, well, on codfw there is nothing to prepare
[18:11:44] <jynus>	 we only need to see the new masters on eqiad
[18:12:04] <volans>	 yes on eqiad
[18:12:23] <jynus>	 however, I think it will be easier to failover from the current masters, and then failover the eqiad clusters
[18:12:45] <volans>	 works too
[18:12:46] <jynus>	 which means no TLS codfw -> eqiad for a few seconds
[18:13:20] <jynus>	 the reason being that the circular replication requires going back to the original master
[18:13:41] <jynus>	 otherwise we will break replication executing an infinite number of times the transactions
[18:14:02] <jynus>	 we could not do circular replication
[18:14:19] <volans>	 yes of course
[18:14:47] <jynus>	 but then the topology gets stranger, takes more time, etc
[18:15:09] <jynus>	 the main problem being mediawiki doesn's support multi-tier slaves
[18:15:43] <jynus>	 so Master -> Secondary Master -> slaves doesn't really work
[18:15:58] <volans>	 ah true, sorry forgot the auto-depooling system
[18:16:09] <jynus>	 it is the chronology checker
[18:16:17] <jynus>	 it is based on binlog
[18:16:33] <jynus>	 and binlog from master is not shared with the slaves in the 3rd tier
[18:16:49] <jynus>	 we will fix that, either with pt-heartbeat or gtids
[18:16:52] <jynus>	 but not on time for it
[18:16:59] <volans>	 clearly not
[18:17:00] <volans>	 :
[18:17:02] <volans>	 :)
[18:17:03] <jynus>	 (for next week)
[18:17:07] <jynus>	 or 2 weeks
[18:17:16] <jynus>	 we depend on mediawiki devels
[19:04:05] <volans>	 should I depool old es2001-es2009 too or we wait the really finish the space? what's the plan for those servers in their next life? :)
[19:05:24] <jynus>	 decomm, probably
[19:05:31] <jynus>	 which will need its ticket :-)
[19:06:22] <jynus>	 let's depool all the old ones already (but keep them replicating, at least until they fail)
[19:06:33] <jynus>	 and close the ticket
[19:06:40] <volans>	 icinga?
[19:06:50] <jynus>	 that is part of the decom process
[19:07:05] <volans>	 to avoid pages when they fail
[19:07:14] <jynus>	 https://wikitech.wikimedia.org/wiki/Server_Lifecycle
[19:07:47] <jynus>	 you can disable icinga if you want, too, but eventually will be eliminated
[19:08:03] <volans>	 ok, just want to avoid unnecessary pages
[19:08:07] <jynus>	 sure
[19:10:48] <jynus>	 can you also check the es parts of https://gerrit.wikimedia.org/r/#/c/267659/8/wmf-config/db-codfw.php
[19:11:30] <volans>	 sure
[19:11:58] <jynus>	 it could be a rebase artifact, or an existing issue, but as you are probably connected to it
[19:13:03] <jynus>	 I think it is an existing issue
[19:13:17] <jynus>	 I have having the masters on 2 files, that is error prone
[19:14:34] <volans>	 you need to rebase right?
[19:15:41] <jynus>	 yes, that is not on top of the last config
[19:15:52] <jynus>	 is it ok live?
[19:16:42] <volans>	 yes my changes are already live (git and tendril tree)
[19:16:49] <jynus>	 no, I know
[19:17:13] <jynus>	 I mean if the db-codfw.php config has the same issue (pointing to the wrong master)
[19:17:25] <jynus>	 the one currently active
[19:18:02] <volans>	 yes '10.64.16.187'  => 0, # es1019, master
[19:18:40] <volans>	 the ip is wrong, master is es1019
[19:18:52] <jynus>	 ok, I will fix that
[19:18:58] <volans>	 should be 10.64.48.116
[19:19:36] <jynus>	 I see it now, compared to eqiad
[19:20:01] <jynus>	 I am going to get blind one of these days
[19:20:08] <volans>	 probably some old copy/paste/merge between the 2 files?
[19:20:27] <jynus>	 yeah, I failovered es1* master several times
[19:20:42] <jynus>	 it is way easier than db s? hosts
[19:21:18] <jynus>	 so, now that you are nearly finish, do you know that you want to work next?
[19:21:49] <jynus>	 I will need you of course, at least helping with failover
[19:22:06] <volans>	 I need to finish the SSL stuff and I guess we'll have a lot of stuff for the failover
[19:22:15] <jynus>	 yep
[19:22:34] <jynus>	 how do you see the ssl thing, is it blocked or could be done on time for the failover?
[19:23:21] <jynus>	 BTW, I will be out on the 24, 25 and 28
[19:24:09] <volans>	 I'll speak with Guillame tomorrow morning and get the status of it's patch, they need it too so we should get it hopefully tomorrow itself
[19:24:22] <jynus>	 nice
[19:24:24] <volans>	 I'll give you an update tomorrow morning, I think is doable
[19:24:53] <jynus>	 if we cannot do a whole CA thing, at least having it prepared for when eqiad masters failover
[19:25:22] <jynus>	 as all other servers are easily restartable except primary datacenter masters
[19:26:31] <jynus>	 also, if you are already bored setting up servers, I can take the next batch myself
[19:26:38] <volans>	 would you be ok to turn SSL off for local replicas (no cross DC) for like a day?
[19:29:03] <volans>	 better not :)