[08:14:17] <volans>	 https://gerrit.wikimedia.org/r/#/c/279694/1 up for review, is a refactor of the previous one given a missing puppet module. If looks ok I'll merge and test with the puppet compiler extensively
[08:29:02] <jynus>	 who +1 the previous patch?
[08:31:00] <jynus>	 the previous is ok
[08:31:59] <jynus>	 we do not need the double cert according to the plan
[08:39:11] <volans>	 if we go with the second option, the more complete one, we need it only for S2
[09:01:43] <jynus>	 then we should add like an ssl='double cert' option or something
[09:02:51] <jynus>	 I believe that by not using the function, and separating the definitions into 2, I think on a regular run, the second part would not be run, but the first would
[09:03:38] <volans>	 ok, regarding if the double CA will be needed by others is unclear at this stage, a lot of other clusters use ssl=on but might not need it to migrate to the new one, it depends case by case
[09:04:33] <volans>	 I'm not sure I get what do you mean with separating the definitions into 2
[09:05:42] <jynus>	 puppet has an exec and a file which requires the exec if not present
[09:06:00] <jynus>	 if it is present, the fiel will not be run, but the exec will
[09:07:18] <volans>	 got it, actually would not be an issue given that if the file already exists with the right permissions the exec will not change them, but I'll add the onlyif
[09:17:50] <jynus>	 ok, makes sense
[09:18:19] <jynus>	 but only because it is a temporal thing
[09:18:38] <jynus>	 make sure to ensure => not present at the end of the process
[09:19:05] <jynus>	 we do not want to leave files hanging everywhere
[10:21:03] <volans>	 sure, I've updated the CR and also the document for the SSL plan adding the masters for S3-S7 and the list of hosts for which I would like to run the compiler (maybe a subset if they are too many :) )
[10:33:54] <jynus>	 I cannot look at it right now, but if you followed my concerns there, and you document it is temporal, go on with it
[10:34:35] <jynus>	 just make sure to apply it slowly when done on production to not remove existing files and having the appropiate permissions (not exposed to all users)
[10:36:15] <volans>	 yes I added the "multiple-ca" accepted value for ssl (not yet used) and the creates for the exec
[10:36:41] <volans>	 as for the expose_puppet_certs in general, it will add files, they don't conflict with existing mysql certs
[10:45:00] <jynus>	 ok, then
[10:47:29] <jynus>	 I would like you to give priority to this while I am away
[10:48:04] <volans>	 sure
[10:48:11] <jynus>	 maybe try to reimage to jessie some of the master-to-be on eqiad?
[10:48:51] <jynus>	 I do not expect having much time for that, and could be dangerous, so your call
[10:53:08] <volans>	 I can try, given that master-to-be are almost all the masters for codfw local master, to attach codfw replication to another slave while reimaging I should use repl.pl with the --stop-siblings-in-sync option?
[10:54:03] <jynus>	 yes, probably
[10:55:30] <jynus>	 but it is a time consuming process- backuping data, resinstalling, etc. And we want the servers in mint condition (buffers-wise) by the failover date, so not sure it can be done for all of them
[10:56:28] <jynus>	 the one that affects more pople would be s3, because of 800 wikis
[10:56:41] <volans>	 I also don't like too much the idea to put as a master a just reimaged server
[10:56:44] <jynus>	 the others could be done with a regular failover (ro) after the fact
[10:56:59] <jynus>	 yep, let's not force it
[10:57:21] <volans>	 so in case I'll start with s3 that will be hardly done afterwards
[10:57:24] <jynus>	 in any case, the new masters should be restarted for the ssl
[10:57:47] <jynus>	 that should be easy and safe
[10:58:52] <volans>	 yes
[10:59:30] <jynus>	 and that may not need even replica movements- just stop codfw, restart, start codfw
[10:59:44] <jynus>	 well, codfw too, you get the idea
[10:59:48] <jynus>	 it is on the doc :-)
[11:00:04] <jynus>	 let me check the masters
[11:01:43] <jynus>	 so, for s2 and s3, I would decom the oldest ones and failover to on-warranty ones
[11:02:15] <jynus>	 for the others, continue using one of the oldest ones with 64GB
[11:04:13] <jynus>	 I know it is painful, but it it more painful to get lag everywhere despite semisync
[11:04:43] <volans>	 true, so for s3 we will decomm up to db1044?
[11:05:16] <jynus>	 the idea would be to have only the 3 new servers
[11:05:31] <jynus>	 maybe keeping some old around only for dump,vslow
[11:05:46] <volans>	 ok, given the big hardware diff make sense
[11:05:48] <jynus>	 maybe others, as accesory
[11:05:58] <volans>	 if they can keep up
[11:05:59] <jynus>	 but ideally, getting rid of the older ones
[11:06:19] <jynus>	 well, for dump it is ok if they lag
[11:06:34] <jynus>	 again, we would have more flexibility in a custom-made load balancer
[11:07:06] <jynus>	 but the idea is to put the new ones as the main ones this week
[11:07:23] <volans>	 ok, so can I pick db1074 for s2 and db1075 for s3 as to-be-master?
[11:07:39] <jynus>	 not for s2
[11:07:51] <jynus>	 there are only 2 new servers on s2
[11:08:03] <jynus>	 you want always the master to be the slowest one
[11:08:08] <jynus>	 that is a hard rule
[11:08:23] <jynus>	 otherwise you will have lag 100% of the time
[11:08:23] <volans>	 I though there was a third not yet configured
[11:08:32] <jynus>	 yes, labsdb1008 :-)
[11:08:45] <jynus>	 so no :-)
[11:08:56] <jynus>	 there will be, but those have not yet even ordered
[11:09:18] <jynus>	 the idea is to have a total of 21 serves (aprox) like those in the end
[11:09:31] <volans>	 then db1063 could fit, we will have db1067 and the 2 new ones as slave
[11:09:51] <jynus>	 but I considered unwise to order those without proper testing
[11:10:05] <jynus>	 after they validate, we will order them, finally
[11:10:38] <volans>	 (I'm assuming we will decomm db1054 and db1060)
[11:11:09] <jynus>	 why?
[11:11:26] <volans>	 I mean keep around for vslow/dump
[11:11:29] <jynus>	 21-47 are older, and out of warranty?
[11:11:44] <volans>	 AFAIK 54 and 60 too
[11:11:51] <volans>	 unless we renewed it
[11:11:59] <volans>	 is not shown on racktables
[11:12:06] <jynus>	 we will not get those replaced
[11:12:09] <volans>	 let me check on dell site
[11:12:15] <jynus>	 only the first 50
[11:13:35] <jynus>	 the idea is substituting the first 50, for around 21, 5x more powerful
[11:15:31] <jynus>	 I know you would like 100 of those, I think this is already a good improvement
[11:16:11] <jynus>	 the following will arrive later, but always count what we have now
[11:16:30] <volans>	 lol, sounds pretty good
[11:20:54] <jynus>	 so, in summary- for s3- almost cover all the service with the new servers
[11:21:14] <jynus>	 for s2, probably decom all servers <64
[11:21:44] <jynus>	 so probably failovering to db1024, which used to be the old master, decom the other old ones
[11:22:47] <jynus>	 I am saying that, because it is already reimagined with jessie
[11:22:57] <jynus>	 (probably the only one)
[11:23:21] <volans>	 checking if there is any other jessie, db1024 is 64GB too
[11:23:31] <jynus>	 (note that the masters are not usually the bottlneck)
[11:23:42] <jynus>	 and the slower it is, the less lag we will have :-)
[11:24:13] <volans>	 yes only one with jessie
[11:24:29] <jynus>	 e.g. 1000QPS, including priority reads and writes, vs 22K QPS
[11:24:53] <jynus>	 before that, db1024 was a master non-stop for 3 years
[11:25:15] <jynus>	 if we go for the reliable option, that is the way to go
[11:25:27] <jynus>	 we can failover later, no issue
[11:25:39] <volans>	 ok, updating the doc
[11:26:14] <jynus>	 same would apply to the others- choose the least poweful server, with no incidents history
[11:27:04] <jynus>	 I know it sounds like a bad idea, but those are the sacrifices of heterogenous hardware
[11:28:21] <volans>	 no, it make sense
[11:28:29] <volans>	 I'll check for tickets on phab
[11:28:41] <jynus>	 it is not a my-thing, it is generally considered thing, see http://dba.stackexchange.com/a/18123/30545
[11:28:46] <jynus>	 http://www.xaprb.com/blog/2007/01/20/how-to-make-mysql-replication-reliable/
[11:29:23] <volans>	 that the master needs to be less powerful I fully agree
[11:29:46] <jynus>	 eventually we will get rid of <db1050
[11:30:00] <jynus>	 but the key thing is "eventually" :-)
[11:30:35] <volans>	 :)
[11:31:52] <jynus>	 I am not disagreeing actually with your assessements, I have into account your limited time and my vacations, plus a deadline of 3 weeks
[11:32:29] <jynus>	 I intend to order the new servers these week, which probable will mean having them setup by june
[11:32:36] <jynus>	 *this
[16:42:53] <volans>	 hopefully the last one: https://gerrit.wikimedia.org/r/#/c/280234/ we really need to improve the puppet compiler to handle submodules changes
[23:02:31] <volans>	 last patch on mariadb submodule works, no errors on puppet compilers, I've updated the CR, I'll check the results in detail on Thu