[08:22:54] morning :) [08:35:30] morning [08:44:27] :) [08:46:02] I was looking at es2/es3 on codfw and saw that the master is in eqiad, am I missing something? [08:46:46] mmm [08:46:50] let me see [08:50:54] yes, the idea, from others, being that whenever there will be multi-datacenter reads, send writes to the original masters [08:51:29] I do not agree with it, because we should never send cross-datacenter queries [08:51:51] but right now it doesn't matter, as writes do not happen on codfw [08:51:58] and probably never [08:52:49] at most, there should be a masters.php with only the 1 master list [08:53:08] but for now, take it as "this datacenter is not the primary one" [08:53:16] ok [08:53:21] es2010 is still lost right? [08:53:39] yep, tried it yesterday and could not even connect to the mgmt [08:54:01] great so I guess we can do 3 at the same time, to not leave the cluster empty for es3 [08:54:31] maybe a connect and disconnect will work, but id doesn't matter, we only hold binary logs for 1 week [08:54:41] and anyway data could be corrupted already [08:55:13] so I would do 2005, 2007 and 2009 [08:55:51] and maybe we can use 2008 twice on the next iteration [08:56:16] e.g. 2006, 2008 and 2008 [08:56:36] we should also depool es2010 as malfunctioning [08:56:43] I was thinking the same, sending depool right now, then we can sync on the commands [08:57:04] es 2010 I depooled it already when it broke, you mean permanently deleting it:? [08:57:16] not yet [08:57:30] you never know if new hardware is broken, etc. [08:58:32] oh, I can see it now, I had an old version [08:58:33] I meant only for es2010 [08:58:43] yes, but still [08:58:51] ok leaving there [08:58:59] doesn't hurt :) [08:59:21] the only "complex" thing here will be the master switchover [08:59:28] local master, I mean [08:59:59] but for that, circular replication works perfectly (as it is a PK-only, append only store) [09:02:39] as I said, for codfw do not worry about performance, only about service availability and replication ongoing on 1 server [09:03:15] I was thinking put es2016 (new of 2007) slave of es2014 (new of 2005) that will be slave of current es2006, then moving the master of es2014 to be es1015 and depool es2006 [09:03:38] (dunno I can explain in word those things :) ) [09:04:24] for the actual new topology, we do not care [09:04:51] just making sure they start replicationg from the old master as soon as they are ready [09:04:59] just in case [09:05:39] before you stop mysql on each old server- stop replication and keep the SHOW SLAVE STATUS [09:06:00] old master being es2006 local or es1015 remote? [09:06:03] yes of course [09:06:30] old "local" master [09:06:54] ok, so we change the master tomorrow before migrating it [09:06:59] we could do them from the real master, but probably that will be slower [09:07:22] there are many ways to do it, most are right :-) [09:07:27] lol [09:07:41] just avoiding the wrong ones [09:07:50] tendril updates the topology automatically [09:07:58] so that is a good heath check [09:08:10] otherwise, as you said, it becomes complex to explain [09:08:26] that's good [09:08:53] add a couple of new slaves, which is the easiest to start, then we see [09:09:20] ok [09:09:40] to be fair, it will take some hours, so we will have time [09:11:17] have into account the Deployment times to see if we need to work around that [09:13:16] nothing until 5pm [09:13:21] according to wiki [09:13:32] great [09:13:39] should be enough time [09:13:48] remember icinga [09:14:07] (for the "old") [09:15:20] I'll review the timeline with you in a sec [09:21:50] depool [OLD] | wait traffic | icinga dowtime [OLD] | stop slave [OLD] | save show slave status [OLD] | mysql stop [OLD] | copy data [OLD->NEW] [09:22:00] [OLD first] mysql start skip-slave | check all ok | start slave | monitor replica | repool once ready (low weight if applies) [09:22:17] remove dowtime icinga [09:22:18] [OLD first] mysql start skip-slave | check all ok | start slave | monitor replica | repool once ready (low weight if applies) [09:22:24] damn past... [09:22:31] then same last line for NEW [09:25:52] if mysql is running on the new ones, you may need to stop the new ones [09:26:03] as we reuse the port [09:26:29] no it was never started, it's without any data [09:26:51] the other thing is, if I remember correctly, you may need to do RESET SLAVE ALL (hence the save the slave parameters) [09:27:14] because the new slaves are started with a different relay log name, and will not work directly [09:27:25] this is on purpose [09:27:48] "yes, I really want to start a new slave on a different host" [09:28:11] true (checking the docs for confirmation) [09:28:43] make sure you change master with SSL enabled and the right coordinates (exec_master) [09:29:30] we will need to add the new machines to tendril manually, too [09:29:52] (cannot do know because it requires data changes, which we are about to overwrite) [09:29:55] *now [09:30:04] this only for NEW ones [09:30:05] ok [09:30:18] if the part on OLD is ok I'll start the copy [09:35:16] yep, please do [09:36:02] the only why to break this is with replication- because if we start from the wrong position, we will have to copy over [09:36:16] yes I know [09:38:40] the old one, you can start it as soon as the copy is over [09:38:54] that was my plan [09:39:29] and if you are ok with starting always with the slave stopped, we can make it the default [09:42:30] logging is ok, too - not every stop just "cloning A-> B, C -> D" [09:42:51] ok [09:44:01] I know it seems like a lot of steps, but it is mere routine later, and the background job for me most of the times [09:44:04] :-) [09:44:54] and you should be happy, new machines means better performance and usually less problems for us [09:48:32] yeah! [09:48:47] 2005->2014 started, doing next [09:49:23] (I'll do a more detailed "log" here for reference) [09:50:03] that is ok [09:51:05] that was the idea of this channel [09:55:05] es2007->es2016 started, doing next [10:00:00] how fast is it going? do you have an ETA? [10:00:10] eses2009->es2018 started, monitoring [10:02:21] es servers also recover fast after replication lag because of their append-only nature [10:04:48] the old ones have only 8 cores, so I guess the pigz is slower than the es2001 I copied last time with 16 cores [10:05:00] * volans wondering if the compression is helping in this case or not [10:05:03] I am going to profile a couple of s6 eqiad servers: https://wikitech.wikimedia.org/wiki/MariaDB/query_performance [10:05:26] ok [10:05:39] probably very little in this case because content is already compressed [10:06:46] I remember actually compressiong again db2005 with innodb compression [10:07:07] and you can see that there has been very little diference (just a few GB) [10:07:13] I'm seeing ~130MB/s on the sender uncompressed and ~108MB/s on the receiver before decompressing :( [10:07:22] that is a gain, still [10:07:30] "CPU" is free [10:07:38] in this case yes [10:07:49] for main server, we get, however a 5x difference [10:07:59] 1TB -> 200MB [10:08:02] that's a good improvement! [10:08:10] main I mean db* [10:08:28] so those are things that you could document! ;-) [10:09:29] while copying, check the ssl ticket if you want or some reviews I sent your way [10:09:50] yes I have some gerrit to check [10:10:07] I hacked a bit pt-heartbeat to add a shard column [10:10:37] hopefuly that will make: a) our checks easier b) usable for mediawiki production [10:10:40] I saw that, there is a way from gerrit to get the diff from original one and yours? I guess I can download the original and do a diff :) [10:10:49] I have the diff [10:11:32] check pt-heartbeat-patch on my home on terbium [10:11:51] but I do not intend to mantain that, I do not want to not use the official package [10:12:08] ok [10:12:30] just for our specific use case, with shards [10:12:51] use regular percona toolkit for anything else [10:12:59] *using [10:13:30] (for example, I have not changed the --check or --monitor, just the update) [10:14:39] make sense [10:15:42] I would love to say "here is our fork of mariadb, and our fork of heartbeat, and out fork of pt-table-checksum" (the last one is broken for us) [10:15:52] but we cannot mantain all of those [10:16:00] I sent patches to upstream [10:16:16] make sense, patch looks ok and minimal [10:16:30] but they usually get ignored, or non-useful for them [10:16:36] too bad [10:17:43] we have some very specific setups, like mariadb10, or binary encoding, or lots of tables [10:17:54] that many tools do not support [10:26:59] ETA looks on the order of 6.5h... I hoped it would be a bit quicker [10:30:06] it could be the 3 servers competing for bandwidth; or competing with elastic, which also was doing some maintenance on codfw [10:30:28] check ganglia for the nwtwork stats [10:32:06] I saw the link from mark yesterday but I cannot login too [10:32:42] ganglia.wikimedia.org has it too [10:32:59] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=bytes_in&s=by+name&c=Miscellaneous%2520codfw&tab=m&vn=&hide-hf=false [10:33:20] I checked with para.void on friday and he agreed there was no problem at all, even between rows [10:33:33] being on misc instead of mysql is probably my fault [10:33:35] and it couldn't possibly go quicker than their network link, GigE, so ~100 MB/s ignoring compression :) [10:33:53] so not sure why you'd think it would go quicker? [10:33:58] so compression helps! A little [10:34:21] of course cannot go faster than 1Gbps, I thougth compression would have done a bit more... [10:34:53] just that :) [10:35:11] we are going ~140MB/s before compressing [10:35:31] we should consider 10 Gbps perhaps for the next batch [10:35:37] even though in normal operation that isn't very necessary [10:36:07] the only blocker for those is switches, those cost a fortune [10:36:32] but I agree, this is only done once in a server's life time [10:36:34] why do you think we don't have those? [10:36:36] probably not necessary, maybe just with row based replica and parallel replication it might help at some point [10:36:54] they don't cost a fortune :) [10:37:37] well, because I supposed they were the reason we do not go dual card! [10:37:47] I was wrong [10:37:55] dual card? [10:38:09] 2 active interfaces at the same time [10:38:12] I guess bonding [10:38:18] what does that have to do with 10G? [10:38:23] that's something else [10:38:53] that's redundancy usual, but you could use bonding for aggregation too in theory (not that I would suggest it) [10:39:07] doesn't help for single stream transfers, like this probably is [10:39:49] yes, I was correlating it to "network equipement is expensive" [10:40:30] not compared to DB servers ;) [10:40:35] :-) [10:41:11] just that we are on topic... do we have double power supply? [10:41:51] yes [10:41:52] the ones that I bough have [10:42:05] all [10:42:10] great [10:42:57] and jynus for the double nic/bonding, usually the problem is that at scale it make the network become a real mess because you need to scale switches in a pyramidal way [10:43:46] to be fair, we already plan for "all servers in a row fails" [10:44:04] exactly [10:44:07] full server redundancy [10:45:48] that is why you sometimes see idle dbs without proper load- they are slower, so we try to not used them as much, but are there to provide redundancy [10:46:26] hopfully soon we will have full datacenter redundancy, too [10:49:12] working on that :) [10:49:21] exactly! [10:49:37] rob was saying we will get new servers today? [10:49:49] yes, but chris is out [10:49:57] so not until next week [10:50:02] iad or dfw? [10:50:12] labsdb has priority, he said he will set that first [10:50:27] that is eqiad, but not production [10:50:47] to be fair, I do not know how to setup up that [10:50:53] lol [10:50:54] I know, I mean [10:50:58] that [10:51:12] I do not have a plan to do it both fast and no impacting [10:51:30] so we will improvise something [10:51:47] if you want we can talk about it, I don't know the issues there [10:52:02] the explanation is that labsdb doesnt contain the same data than in production [10:52:10] it is filtered [10:52:29] so we can filter all again, which will take a lot of time [10:52:49] or we can clone the existing servers, which will impact users [10:53:37] how many existing we have? [10:53:45] we cannot depool one? [10:53:50] probably cloning a dbstore and purging in parallel is the way to go [10:53:57] it was 3 [10:54:13] now 2, but it is overloaded now [10:54:27] new servers will fix that [10:54:51] so with 1 will not survive, even in the lower load period ? (dunno weekends, US night) [10:54:54] but I would like to avoid touching the existing ones as much as possible [10:55:08] no, preciselly weekends are when it is more active [10:55:16] it is mostly used by contributors [10:55:33] so monday night US maybe is lower [10:55:37] and the ha setup is not precisely great (I want to change that) [10:56:07] as I said, I would prefer to clone it from production, then filter [10:56:10] I don't know the filter well, it might be possible to clone a production one and apply the filters locally [10:56:18] because the existing ones have crashed several times [10:56:21] heavy but not impactful [10:56:43] the cloning should not take much time, is the filtering that I am worried [10:57:16] many reasons to shoot yourself in the foot [10:57:50] I think we have to do it anyway because current state of the other servers is not ideal [10:57:55] if not applied correctly yes, it's dangerous [10:58:07] there are alredy scripts for filtering, I just not tust them 100% [10:58:45] and all that has to run while replication catches up [10:58:57] yes those are on the intermediate server from which they replicate, and you want to apply them also locally? [10:59:24] just triggers or more custom stuff? [10:59:43] so there are 3 filters [11:00:04] replication filters, triggers and views [11:00:09] views are automatic [11:00:35] for replication filters we just need to drop all databases and tables not replicated [11:01:03] and for triggers, we do not need the triggers locally, just apply the "post trigger scripts" [11:01:28] so mostly everything is automatic already, I just have an irrational fear that "something could go wrong" [11:01:50] comprehensible [11:01:57] i think if we allow access to some trusted users first [11:02:02] that may help [11:02:35] "hey, we have this new server that is 10x faster, can you help us test it?" [11:03:59] the others are eqiad [11:04:11] that's a good idea [11:04:31] we have a couple of months (for a change) to set them up [11:05:07] they will go to s2 and s3 to decom old servers and provide disk more space and better specs [11:05:27] next quarter is when servers will start coming fast :-) [11:06:12] cool [11:06:20] the main focus for this should be the failover [11:06:20] * volans brb [11:11:06] (when you come back) I have enabled the slow log for 1/20 queries on two servers [11:11:13] * volans back [11:11:19] saw the log [11:11:23] that should not affect much performance [11:11:40] but I must keep an eye for the log size [11:11:49] I log to a different partition just in case [11:12:19] and 1 day will give us 5-10 GB of data to analyze [11:12:31] as long as it's not /srv or / should be ok ) [11:12:33] :) [11:12:44] it is /, actually :-) [11:13:21] tmp doesn't have a separate partition, but mysql's tmp is in /a or /srv [11:13:41] saw that on the others [11:13:54] so "it should not affect mysql", the quotes are why I log iy [11:14:28] no reason because engines like innodb or toku default to using the sqldata dir anyway [11:14:47] and no temp space is as broken as not data space for mysql [11:15:46] all of this is for https://phabricator.wikimedia.org/T126802 [11:16:12] and our own record, too, until we have performance schema on all hosts [11:16:58] ok [11:17:24] check if you can the ssl thing, or any other task while you wait for the copy [11:17:41] I will focus on heartbeat [11:17:44] ok [12:02:46] we have issues from time to time with tokudb [12:02:58] recreating the table usually fix them [12:03:08] Error 'Incorrect key file for table 'user_properties'; try to repair it' on query. [12:03:25] as easy as ALTER TABLE user_properties ENGINE=InnoDB; [12:03:29] START SLAVE; [12:03:55] upgrade may fix that, but requires labs downtime [12:05:06] engine InnoDB? [12:05:32] yeah, if I keep them on toku, it will happen agai [12:05:53] https://phabricator.wikimedia.org/T109069 [12:06:20] ah ok :) [12:06:33] it seems to be a race condition of tables with no primary key and that version [12:06:50] it stopped happening when I upgraded the other hosts [12:07:06] in any case, no toku in production [12:08:24] :) [12:09:38] upstream devels like to say "we fixed this on MariaDB 15.6" [12:09:56] but that doesn't work for us, we cannt just upgrade all machines at once [12:11:06] of course not [12:12:27] specially when they are trying to push hard the buggy 10.1, so I sometimes go to troll them at #mariadb [12:55:06] all copies running fine (~1.4TB done), going for lunch [13:48:11] WIP usually means "I am not ready for reviews" [13:48:39] in particular, that patch has (at least) a syntax problem [13:50:23] (not done on purpose, just I sent what I have with a [WIP], without proper testing) [13:50:50] it is possible that there was no WIP when I saw it? I thought I checked for that [13:51:08] no was there, my bad [13:54:13] for the SSL gehel is sending a CR with the exposure of puppet cert separated from his other changes [13:54:16] compiler is the best way to check those https://puppet-compiler.wmflabs.org/1912/db1009.eqiad.wmnet/ [13:54:44] should not fail jenkins too? [13:55:00] oh, I compiled it after an ammend [13:55:17] but do not trust fully the compiler [13:55:21] ok [13:55:41] it "compiles", so good for logic, but it does not run the code [13:56:44] it would not have mattered much here, just a "failed puppet run" [13:56:58] but syntax error should have been catched [13:57:08] yes, probably [13:57:29] I didn't do it because it was still "WIP" [13:57:58] I though that it would run automatically with jenkins [13:58:28] volans, no, because there is no good way to know which hosts will affect and which will not [13:58:54] e.g. if there is a syntax error, it would apply to all misc hosts, if not, only to those with the if [13:59:20] someone proposed to do someting like a line "puppet-compiler: X Y Z" [13:59:29] but it has to be programmed :-) [14:00:03] as usual, do not think what ops can do for you, but what *you* can do for ops :-D [14:00:16] that's clear, my thought was that jenkins was right now able to detect the syntax error like the missing colon [14:00:34] was it? [14:00:39] nope... it passed [14:01:20] I think it only does puppet lint checks [14:01:25] maybe because class { 'mariadb::heartbeat' would be checked only at runtime [14:01:41] yes, that is a limitation [14:01:50] also for the puppet compiler, probably [14:01:52] ok, good to know [14:02:48] I'll look at the other 2 CR then [14:02:59] what was the SSL thing? [14:03:45] If I have to know, if not, continue doing what you are doing [14:04:34] gehel has sent a CR to copy puppet certs from puppet dir to wherever you want, if merged we could use it to copy them to /etc/mysql/ssl [14:04:42] and use the puppet certs for the replica [14:06:43] before was part of a CR with all other stuff for the ElasticSearch part, after talking he sent it separately [14:06:58] o great [14:07:13] https://gerrit.wikimedia.org/r/#/c/274382/1/modules/sslcert/manifests/expose_puppet_certs.pp [14:07:15] the only thing is, application has to be planned [14:07:52] yeah will not be that easy for us, I probably need to add something to puppet so that we can do an optin server by server if a mysql restart is needed [14:07:53] as it will be a different CA, replication will break on those hosts that the previous cert has been applied [14:07:57] and I want to check the CA too [14:08:22] if I can tell mysql to trust multiple CAs [14:08:25] the good and the bad news is that the files can be moved at any time [14:08:35] because it only reloads it at start time [14:08:36] true, until mysql restart will not see them [14:08:42] but it only reloads it at start time [14:08:49] which complicates things [14:09:00] there is no command to force reload right? [14:09:05] however- the main thing is having it prepared for the failover [14:09:08] nope [14:09:21] tested it on several hosts [14:09:35] and when there no traffic "mass restart" the cluster [14:09:44] failover the pending masters, etc [14:10:08] so check what is needed for the mysql side [14:10:21] (the static part), copy, permissions, etc. [14:10:30] and we can setup a plan later [14:10:35] nobody cares at mysql: https://bugs.mysql.com/bug.php?id=75404 [14:10:53] yeah [14:11:23] create an account and help me with the "Affects me" :-) [14:12:41] to be fair, not many people use TLS, either it is internal-only traffic or it is tunneled with ssh/VPN/etc [14:13:11] it was worse before, where the default binary did not even work with SSL [14:13:25] I had my cup of fun with that :-) [14:14:41] oh yeah they had the other ssl implementation, not openssl [14:15:16] not only that, but there was a bug that made mariadb (an mysql) incompatible with most ciphers [14:15:52] try to see what would be needed to help with the SSL thing, maybe prepare a patch [14:16:20] and if you get blocked, I can suggest you even more tickets [14:17:33] ok, I'll assume that gehel patch will be merged [14:19:05] sure, you can cherry pick it to your local repo and build on top of that [14:19:53] and it comes to testing... do we have a host for tests with the wmf-maria package? [14:20:28] we don't, but there are several things that you can do, depending on what you want [14:21:01] one, labs virtual machine [14:21:17] I can get the internal package there? [14:21:43] you can do it everywhere, AFAIK, carbon is public [14:22:18] two, for more production-like environement, you can depool a production machine and test it there [14:22:46] ok [14:23:16] bonus points if it is a precise or trusty machine and you reinstall at the same time :-) [14:23:28] lol [14:23:40] but I do not expect you to do that every time [14:24:04] there are some mysql that are less important- passive slaves [14:24:37] e.g. I am testing the heartbeat thing on m5 master, knowing that I could affect m5-slave, but that currently has no traffic [14:24:53] (and, BTW, I reimagined it to jessie at the same time) [14:24:57] :-P [14:25:02] bonus point for you [14:26:23] I do not like having idle machines, I consider it a waste of resources [14:26:54] but there are many machines that if they failed, we could live without [14:29:22] I have wmf-mariadb10 and puppet installed on my work machine too, for easy testing [14:29:54] remember that all we do is public, so anything that is not data, everybody can install it [14:30:23] I've VMs too, installing maria now [14:30:35] for puppet I need to see how to do it [14:30:36] :-) [14:33:33] (for the rest of the people, the data is also public, but not the raw db backups) [15:00:15] by any chance do you remember who creates the mysql system user? [15:07:27] mmm [15:07:35] probably the mariadb role [15:07:37] let me check [15:08:15] user { 'mysql': [15:08:20] mariadb config [15:08:32] config.pp [15:08:56] found, thanks [15:09:16] our debs doesn't create it [15:09:17] btw [15:10:50] yes, that is like a feature [15:11:08] no post execution [15:11:23] ok [15:11:42] others are a bug- like the dependencies [15:12:11] yeah :) mysql-common and libmysqlclient [15:12:14] libaio1 libjemalloc1 [15:12:27] no those should not be installed [15:12:33] as they are provided [15:12:54] true, all in puppet [15:13:17] * volans needs to get used to do everything trough puppet, not even debs :) [15:13:23] no [15:13:30] that is a bug on my package [15:13:39] that I fixed with puppet [15:13:45] it should be on the package [15:13:48]