[08:14:17] https://gerrit.wikimedia.org/r/#/c/279694/1 up for review, is a refactor of the previous one given a missing puppet module. If looks ok I'll merge and test with the puppet compiler extensively [08:29:02] who +1 the previous patch? [08:31:00] the previous is ok [08:31:59] we do not need the double cert according to the plan [08:39:11] if we go with the second option, the more complete one, we need it only for S2 [09:01:43] then we should add like an ssl='double cert' option or something [09:02:51] I believe that by not using the function, and separating the definitions into 2, I think on a regular run, the second part would not be run, but the first would [09:03:38] ok, regarding if the double CA will be needed by others is unclear at this stage, a lot of other clusters use ssl=on but might not need it to migrate to the new one, it depends case by case [09:04:33] I'm not sure I get what do you mean with separating the definitions into 2 [09:05:42] puppet has an exec and a file which requires the exec if not present [09:06:00] if it is present, the fiel will not be run, but the exec will [09:07:18] got it, actually would not be an issue given that if the file already exists with the right permissions the exec will not change them, but I'll add the onlyif [09:17:50] ok, makes sense [09:18:19] but only because it is a temporal thing [09:18:38] make sure to ensure => not present at the end of the process [09:19:05] we do not want to leave files hanging everywhere [10:21:03] sure, I've updated the CR and also the document for the SSL plan adding the masters for S3-S7 and the list of hosts for which I would like to run the compiler (maybe a subset if they are too many :) ) [10:33:54] I cannot look at it right now, but if you followed my concerns there, and you document it is temporal, go on with it [10:34:35] just make sure to apply it slowly when done on production to not remove existing files and having the appropiate permissions (not exposed to all users) [10:36:15] yes I added the "multiple-ca" accepted value for ssl (not yet used) and the creates for the exec [10:36:41] as for the expose_puppet_certs in general, it will add files, they don't conflict with existing mysql certs [10:45:00] ok, then [10:47:29] I would like you to give priority to this while I am away [10:48:04] sure [10:48:11] maybe try to reimage to jessie some of the master-to-be on eqiad? [10:48:51] I do not expect having much time for that, and could be dangerous, so your call [10:53:08] I can try, given that master-to-be are almost all the masters for codfw local master, to attach codfw replication to another slave while reimaging I should use repl.pl with the --stop-siblings-in-sync option? [10:54:03] yes, probably [10:55:30] but it is a time consuming process- backuping data, resinstalling, etc. And we want the servers in mint condition (buffers-wise) by the failover date, so not sure it can be done for all of them [10:56:28] the one that affects more pople would be s3, because of 800 wikis [10:56:41] I also don't like too much the idea to put as a master a just reimaged server [10:56:44] the others could be done with a regular failover (ro) after the fact [10:56:59] yep, let's not force it [10:57:21] so in case I'll start with s3 that will be hardly done afterwards [10:57:24] in any case, the new masters should be restarted for the ssl [10:57:47] that should be easy and safe [10:58:52] yes [10:59:30] and that may not need even replica movements- just stop codfw, restart, start codfw [10:59:44] well, codfw too, you get the idea [10:59:48] it is on the doc :-) [11:00:04] let me check the masters [11:01:43] so, for s2 and s3, I would decom the oldest ones and failover to on-warranty ones [11:02:15] for the others, continue using one of the oldest ones with 64GB [11:04:13] I know it is painful, but it it more painful to get lag everywhere despite semisync [11:04:43] true, so for s3 we will decomm up to db1044? [11:05:16] the idea would be to have only the 3 new servers [11:05:31] maybe keeping some old around only for dump,vslow [11:05:46] ok, given the big hardware diff make sense [11:05:48] maybe others, as accesory [11:05:58] if they can keep up [11:05:59] but ideally, getting rid of the older ones [11:06:19] well, for dump it is ok if they lag [11:06:34] again, we would have more flexibility in a custom-made load balancer [11:07:06] but the idea is to put the new ones as the main ones this week [11:07:23] ok, so can I pick db1074 for s2 and db1075 for s3 as to-be-master? [11:07:39] not for s2 [11:07:51] there are only 2 new servers on s2 [11:08:03] you want always the master to be the slowest one [11:08:08] that is a hard rule [11:08:23] otherwise you will have lag 100% of the time [11:08:23] I though there was a third not yet configured [11:08:32] yes, labsdb1008 :-) [11:08:45] so no :-) [11:08:56] there will be, but those have not yet even ordered [11:09:18] the idea is to have a total of 21 serves (aprox) like those in the end [11:09:31] then db1063 could fit, we will have db1067 and the 2 new ones as slave [11:09:51] but I considered unwise to order those without proper testing [11:10:05] after they validate, we will order them, finally [11:10:38] (I'm assuming we will decomm db1054 and db1060) [11:11:09] why? [11:11:26] I mean keep around for vslow/dump [11:11:29] 21-47 are older, and out of warranty? [11:11:44] AFAIK 54 and 60 too [11:11:51] unless we renewed it [11:11:59] is not shown on racktables [11:12:06] we will not get those replaced [11:12:09] let me check on dell site [11:12:15] only the first 50 [11:13:35] the idea is substituting the first 50, for around 21, 5x more powerful [11:15:31] I know you would like 100 of those, I think this is already a good improvement [11:16:11] the following will arrive later, but always count what we have now [11:16:30] lol, sounds pretty good [11:20:54] so, in summary- for s3- almost cover all the service with the new servers [11:21:14] for s2, probably decom all servers <64 [11:21:44] so probably failovering to db1024, which used to be the old master, decom the other old ones [11:22:47] I am saying that, because it is already reimagined with jessie [11:22:57] (probably the only one) [11:23:21] checking if there is any other jessie, db1024 is 64GB too [11:23:31] (note that the masters are not usually the bottlneck) [11:23:42] and the slower it is, the less lag we will have :-) [11:24:13] yes only one with jessie [11:24:29] e.g. 1000QPS, including priority reads and writes, vs 22K QPS [11:24:53] before that, db1024 was a master non-stop for 3 years [11:25:15] if we go for the reliable option, that is the way to go [11:25:27] we can failover later, no issue [11:25:39] ok, updating the doc [11:26:14] same would apply to the others- choose the least poweful server, with no incidents history [11:27:04] I know it sounds like a bad idea, but those are the sacrifices of heterogenous hardware [11:28:21] no, it make sense [11:28:29] I'll check for tickets on phab [11:28:41] it is not a my-thing, it is generally considered thing, see http://dba.stackexchange.com/a/18123/30545 [11:28:46] http://www.xaprb.com/blog/2007/01/20/how-to-make-mysql-replication-reliable/ [11:29:23] that the master needs to be less powerful I fully agree [11:29:46] eventually we will get rid of but the key thing is "eventually" :-) [11:30:35] :) [11:31:52] I am not disagreeing actually with your assessements, I have into account your limited time and my vacations, plus a deadline of 3 weeks [11:32:29] I intend to order the new servers these week, which probable will mean having them setup by june [11:32:36] *this [16:42:53] hopefully the last one: https://gerrit.wikimedia.org/r/#/c/280234/ we really need to improve the puppet compiler to handle submodules changes [23:02:31] last patch on mariadb submodule works, no errors on puppet compilers, I've updated the CR, I'll check the results in detail on Thu