[08:27:28] morning [08:27:36] morning there [08:32:19] so to copy the remainings today I need to change the topology of es2/es3 on codfw, I have dream about it or did you mention that we have some scripting for that? :) [08:33:40] s/have/had/ [08:33:44] yes [08:33:48] but first, tendril [08:34:08] we need feedback on the state of the servers [08:34:57] ok, I had plan to start the copy and do tendril, this works too [08:35:23] there is operations/software/dbtools/repl.pl [08:51:06] care to give me one hand for this schema change? [08:51:33] one extra hand will be great [08:54:33] sure [08:54:41] sorry was digging into tendril scripts :) [08:54:57] so see my comments in -operations [08:55:23] I predict replication breakage of dbstore* labsdb* and db1047 [08:55:54] I just want you to have tendril and icinga handy if you see something else [08:56:19] ok, make sense, you're adding the column multiple times from the point of view of the multisource, will you just skip it? [08:56:27] yes [08:56:39] mysql -e "SET default_master_connection='$shard'; SET sql_slave_skip_counter = 1; START SLAVE;" [08:56:55] I think it is safer for production executing it on the master [08:57:12] than doing it individually and skipping some hosts by accident [08:57:14] instead of each single server locally, agree [08:57:28] as we do not use multisource in production [08:57:49] dbstore and labs can take a 10-second replication hit [08:58:22] unfortunately there isn't add column ... if not exists [08:58:30] yeah :-) [08:58:37] do you want me to fix some of the replica or you do all of them? [08:58:48] happy if you can help [08:58:53] which ones do you want to take? [08:59:13] labsdb*? [08:59:17] ok [08:59:21] login to 1 and 3 [08:59:50] check that it efectively breaks before skipping, and I will tell you when I apply it to the masters [09:01:21] so, about to do s2 [09:01:30] when you are ready [09:01:33] labsdb1003.eqiad.wmnet and labsdb1001.eqiad.wmnet ? [09:01:38] yep [09:01:48] mmmh... show slave status is empty... [09:02:05] ha [09:02:20] welcome to great multi source replication syntax [09:02:30] check my line up there [09:02:41] SET default_master_connection='s2' [09:02:47] ok [09:03:28] also, SHOW ALL SLAVES STATUS;, but with 7 masters, it gets confusing [09:03:34] ready [09:04:36] doing on s2 [09:04:39] the skip counter should be SET GLOBAL? [09:04:48] or not for the mutirepl [09:05:04] only with default_master_connection set! [09:05:24] as I said, all very intuitive and all that [09:05:24] it applies only to my shard thread [09:05:42] yes [09:06:02] it in the mariadb docs, commands either have a 'shard' option [09:06:10] or you set that variable [09:06:29] ok, now ready, right? [09:06:54] yep [09:07:01] doing, for real [09:07:42] broke and fixed on labs [09:07:54] great, same on dbstore, let me check db1047 [09:08:52] if didnt broke db1047, because it has a heartbeat.heartbeat ignore [09:09:04] which we will have to fix at other time [09:09:12] ok [09:09:45] ok, the next ones should be more of the same [09:09:55] doing s3 [09:10:01] ok [09:10:19] done [09:11:03] all good [09:11:59] one sec, checking something [09:13:14] dbstore1001 and dbstore2001 on tendril have replica No,Yes [09:13:16] also, dbstore2002 [09:13:42] those are ok, we stop the sql thread for delaying them [09:13:59] true, forgot that trick [09:13:59] but it will be a fun tomorrow (24 hours delayed) :-) [09:14:31] dbstore2002 has also issues, because it does not replicate the heartbeat table [09:14:38] so those 2 we have to fix [09:14:47] the next is s4 [09:15:39] fixed [09:15:44] fixes [09:15:51] s/s/d/ [09:16:01] s5 [09:16:30] ready [09:16:34] done [09:17:03] I am checking at the same time all of production, you never now with a schema change [09:17:11] s6 [09:17:22] make sense ;) [09:17:52] done [09:18:00] happily, heartbeat has very little traffic [09:18:07] (1 per second :-P) [09:18:14] eheh [09:18:21] s7 [09:18:49] done [09:19:14] x1, which should not be on labs (but check it) [09:19:43] not there [09:20:03] fixed [09:20:20] and now es2 and es3, let's see [09:20:42] I think those are not replicated at all [09:20:56] to multisource, I mean [09:21:07] ok [09:21:38] so, I do not need more help, thank you! [09:21:47] ok, s1 you already did? [09:22:05] or leave it for last :) [09:22:16] yes, in theory that should not conflict, because it was the first [09:22:38] I was wrong for dbstore because I had done m1-5 before [09:22:44] what about tools and m1/m5? [09:22:50] ah ok [09:23:11] I will do es1 and es2, and tools, but I do not expect problems there [09:23:52] es2 and es3 I guess, es1 is not replicated [09:24:20] true, for terndril, you just need: [09:24:35] bash tendril-host-add.sh db1021.eqiad.wmnet 3306 /home/jynus/.my.cnf tendril | mysql -h db1011 tendril [09:24:47] bash tendril-host-enable.sh db1018.eqiad.wmnet 3306 | mysql -h db1011 tendril [09:24:56] copied from my bash history [09:25:19] tendril db is on db1011 [09:25:29] I found the script, was just about to ask to confirm that db1011 was the right db to go :) [09:25:32] thanks! [09:25:55] I asked for that first because it would be way easier when we have the tree [09:26:33] I myself get confused with "the master", no "the local master", no "the previous local master" [09:26:47] a graph is worth a thousand words [09:27:14] agree [09:27:18] it may seem stupid, but when you are working with 150 hosts at the same time, you end up loosing track [09:27:33] (or I may just be getting old) [09:27:36] :-) [09:27:44] it's not stupid, is called stackoverflow :-P [09:28:11] or in this case more buffer pool :D [09:28:21] lol [09:34:34] passengers, if you look on the right window, you can see a "DELETE FROM i_pl WHERE EXISTS ( SELECT 1 FROM enwiki_p.page WHERE pl_from = page_id AND page_namespace = 6 )" blocking the replication thread on labsdb1001 [09:37:00] BTW, information is gathered every 5 minutes, so hold your breath when adding a host for the first time [09:37:14] ok, I just got an error, checking [09:37:52] something that used to happen to me is that some hosts use tendril as user name [09:38:03] while others use watchdog [09:45:51] yes, no user tendril on the new es2xxx [09:45:56] there is watchdog though [09:46:39] use that, or add a new user [09:46:52] I think on puppet it is tendril now [09:47:07] but old hosts continued using watchdog [09:47:48] it is as easy as changing your .my.cnf [09:48:14] lol for puppet, class definition [09:48:18] $tendril_user = $passwords::tendril::db_user [09:48:31] usage [09:48:33] tendril_user => 'watchdog' [09:48:47] I don't see the variable used... [09:50:23] using watchdog then, we migh want to fix puppet though [09:50:31] sure [09:50:53] or, maybe, just stop using tendril for graphing and integrate graphite [09:51:04] which would be larger win [09:51:39] ehehe, we still need a topology graph [09:51:42] in any case [09:51:48] no [09:51:56] because remember that has to go away [09:51:57] too [09:52:08] because the current library does not support cycles [09:52:27] yes, I mean graphite cannot do topology graph [09:52:29] and slow queries will be handled by performance schema [09:52:39] yes [09:52:48] I mean, of course it cannot [09:53:06] I am just saying that we need to do larger changes [09:53:52] we still need a graph and a reporting system [09:54:53] oh yes [10:02:24] all set on tendril [10:05:54] * volans looking at repl.pl [10:06:51] so that script is mostly hot, it only creates ~5 seconds of lag on both servers [10:08:22] in theory, we should never have a master with a higher version than a slave, but it is not a huge problem [10:08:40] we don't need to [10:08:42] at least within the versions and options we are moving [10:08:54] I though we would put one of the old as local master [10:09:03] because we still don't trust the new ones [10:09:11] yes, much safer [10:09:14] or you want to avoid the double change of topology? [10:09:47] no, no issues with that, the script does its job [10:11:46] if I get how it works we need first to move all siblings as child of for example es2005 [10:11:51] note that the script is only for moving around slaves, so you have to move the children first, then the middle parent [10:12:00] exactly what I was about to say :-) [10:12:11] you are getting it faster and faster each time [10:12:23] and then the es2005 switch-child-to-sibling with es2006 [10:12:24] ok [10:13:32] we run it from the bastion usually? [10:13:33] we will not need a script soon, once 10 masters and gtid [10:15:02] the good thing about the script, that Sean worked very well, is that he added some additional checks "is this really a parent of X", etc [10:15:08] we need more of that [10:17:23] so I should run: [10:17:30] perl repl.pl --switch-sibling-to-child --parent=es2005.codfw.wmnet:3306 --child=es2016.codfw.wmnet:3306 [10:17:35] same for es2014 and es2007, then [10:17:41] perl repl.pl --switch-child-to-sibling --parent=es2006.codfw.wmnet:3306 --child=es2005.codfw.wmnet:3306 [10:18:08] looks about right [10:18:19] wait a bit between runs [10:18:27] so that there is no replication lag [10:18:47] sure, the code we clone it on a bastion? [10:19:00] (also to check the tree is right, etc.) [10:19:20] I used to have it on iron, but it is under maintenance [10:19:36] I am using terbium as my "center of operations" [10:19:48] lol [10:20:07] however, I think, unlike iron, terbium ip may not have all grants [10:20:25] you shoudl replicate your "center of operation" on codfw too ;) [10:21:45] looks like a task! [10:22:22] but at some point we have to compromise and get things done, put it on the todo list [10:22:39] should we ask for a dbops user to have a common place ? [10:22:55] ask? [10:22:58] who? [10:23:02] ourselfs :D [10:23:44] so, we had iron, terbium and db1011 [10:24:22] use or setup whatever it works for you [10:24:42] I think there is a codfw terbium [10:25:15] I'm on bast2001 [10:25:56] I do not like bast hosts because they are shared with non-ops [10:26:32] ah ok [10:27:32] we should totally create one only for db operations, specially for long-running tasks [10:28:02] run it from iron this time, let's not bikeshade it [10:28:16] yes, things that run on screen and we need to pass over between people [10:28:26] ok iron it is [10:28:38] but we have to think a host on eqiad and one on codfw [10:29:10] maybe terbium is not the best either because it is shared with mediawiki's cron [10:30:04] it could be salt master, after all, it is used for the same things at admin layer [10:30:36] and we may need at some point not only mysql remote execution, but also arbitrary remote execution [10:30:42] do we have a second salt master? [10:31:13] I am lookint at it [10:31:51] there is indeed [10:31:54] sarin [10:32:49] at least that is what site.pp says [10:34:03] :) [10:37:07] ready to change topology [10:37:33] so people agree, we will remove iron's grants and add to those hosts [10:40:10] cannot see it yet: https://dbtree.wikimedia.org/ [10:40:26] not yet press enter, I was opening shells... [10:40:32] on the targets :) [10:41:36] done [10:42:56] next, next, next [10:43:01] looks good, going ahead with the others :) [10:43:06] lol [10:43:25] with this you will have done everything except a master failover [10:43:41] so I will be able to take a vacation :-) [10:43:59] :-P [10:46:10] doing the same for es3 [10:46:47] should I remove es2010 from tendril? [10:46:58] or we think it could resuscitate... [10:48:39] let's keep it for now, even if it doesn't, it will be faster doing all at once [10:49:00] ok [10:49:26] decom takes a while, better doing all toghether [10:50:27] I would put es2006 as a slave, so when we stop it, we do not bother the real master [10:50:40] but your call [10:50:56] true it's safer, I'll do it [10:53:15] for es2010, just ignore it, as it is down [10:53:22] all done, yes I cannot change it :) [10:53:58] everything looks good so far, isn't it? [10:54:09] with the whole failover, I mean [10:54:15] I am happy about that [10:54:16] yes [10:54:23] that's actually pretty cool [10:55:02] I'll depool 2006 and 2008 and at this point why not also 2004 (es1) to copy them [10:55:20] mmh let me think 16h [10:56:01] maybe es1 better in the afternoon so it will finish tomorrow morning [10:56:30] as you wish, just do not block on me for simple pool/depools [10:56:45] nope :) [10:57:30] if you see anything strange on scap, ping, however, it sometimes breaks, or there are other pending changes, etc [10:57:58] ok [11:00:07] and I shouldb stopping the profiling I started yesterdayt [11:13:01] if still has space :) [11:15:17] * volans will at some point remember to put the right commit message at first attempt [11:23:31] mere 1.2 GB [11:24:02] although I changed yesterday the rate from 1/20 to 1/100 on the slave, because it was too large [11:24:27] I see [12:04:12] started copying data, for es3 I chained them es2008->es2017->es2019 checking ganglia [12:12:42] you know that if es2019 starts going faster, it will not work, right? [12:13:30] in fact, if you use tar, it shouldn't work [12:14:05] it's chained, not parallel [12:14:17] otherwise will go much slower [12:14:31] on es2017 I have a tee on a mkfifo [12:14:38] ah! [12:14:41] that sends to es2019 [12:14:58] I thought you were scanning again 17 [12:14:59] but we can use the full duplex of es2017 both in rx and tx [12:15:02] that is good [12:15:44] we could go crazy and setup a bittorrent/multicast option for copy, but that will work just as good for what we want [12:15:52] I was looking for solutions and found that the guys at tumblr had the same problem :) [12:15:55] http://engineering.tumblr.com/post/7658008285/efficiently-copying-files-to-multiple-destinations [12:16:36] nice, we should include that in the eventual script [12:17:08] does it read and write at full duplex? [12:18:06] and what do I know that guy for? [12:18:56] former FB db? (from the blog) [12:19:35] is going a bit slower but not that much, still ~78MB/s on es2019 before decompressing [12:20:32] so, the student has already surpased the sensei, my job is done [12:21:57] lol, jyoda :-P [14:05:51] there is something at some point of the transfer that slows it down a lot for few minutes, then back to normal [14:06:12] it happened yesterday too, only once, bot es2 and es3 copies... I'm wondering what could be [14:06:19] http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20codfw&h=es2008.codfw.wmnet&m=bytes_in&r=hour&s=by%20name&hc=4&mc=2 [14:06:50] check timestamps [14:07:14] it could be some process going at the same time [14:07:48] or just some scalability point, like hitting some memory limit, etc [14:08:52] seems more data-dependent, all machines seems to reach it at the same time of the copy, but in different absolute time [14:09:55] how much is it, broadly in GB/percentage? [14:10:02] could be binlog or ib_logfile I guess [14:10:09] ah, yes [14:10:26] copying small files vs ibdata1 [14:11:16] those also would compress more, and would make CPU the bottleneck [14:13:13] that instead was ~70% for the same period instead of 100% (CPU) [14:14:16] so it seems that after today we will be out of the woods, right? [14:15:30] yes, es1 still to do, but is not that urgent I'll launch a copy of 1 or 2 later [14:15:40] if you have interest on knowing more about our load, I left the processed query profiling on terbium, on my home [14:15:57] es2 and es3 will have full new machine ready for prod by eod [14:16:45] Exec time 4ms 95-percentile, which is not that bad [14:17:20] do we want/have an easy way to give them a bit of a load test without touching the data? (like percona playback) [14:17:30] sure, I'll take a look, looks good 4ms [14:17:48] that is in the older machines, 176us on the newests (before the new batch) [14:18:31] it is not very representative, that is including things like "SET" or "SELECT 1" [14:19:07] percona playback is broken [14:19:27] but we will setup something [14:20:13] probably using the actual mediawiki, instead of faking it [14:23:00] we have enough capacity to not worry too much about performance on codfw, but focus on regression testiong [14:23:55] (remember that just before you came, I setup by myself 21 machines on codfw) [14:23:56] oh yes, my "load test" was to see if they broke under real usage, not really performance [14:25:17] plus the 3 parser caches [14:26:50] a lot! [14:27:19] plus upgrade half of eqiad from precise to trusty [15:19:32] if you are around, can you give a check to https://gerrit.wikimedia.org/r/274670 [15:20:47] ok [15:23:41] I've actually just stopped a huge mistake [15:25:09] is not missing a bunch of servers in the firstblock? [15:26:06] yes, that :-) [15:26:17] s/spotted/stopped/ [15:26:54] and that is why you do reviews- [15:27:42] binlog by default is mixed, you have changed it to statement too? [15:28:16] no, that was a pending issue, masters are in STATEMENT [15:28:30] but I forgot to change it back on puppet [15:28:53] I want them in mixed/row, but we are not ready yet [15:29:20] ok, just checking [15:29:32] and thanks for that [15:29:39] I should add it to the description [15:29:53] the real thing I don't like, is that we are basically coupling masters with puppet [15:30:02] they are already coupled with mediawiki config [15:30:23] I agree, that should be an orchestration thing [15:30:32] I know pt-heartbeat it's just a check, but will not this complicate a master failover? [15:30:37] but a) that is already a thing [15:31:01] b) it is beter like this than running manually [15:31:46] we can set 2 servers with $master=true, independently of the topology [15:32:12] so it is not a blocker [15:32:54] we could literally set that to both local masters on eqiad and codfw, and we wouldn't care [15:33:20] they will not overwrite each other? [15:33:25] nope [15:33:44] $master here only means "run pt-heartbeat" [15:34:30] we could failover without it, but I want it with that name so we do not forget about it [15:34:43] (all checks would start to fail) [15:35:18] I accept alternatives, that is what reviews are for [15:35:43] * volans rechecking docs for pt-heartbeat [15:39:24] we could have the master on hiera, too [15:39:32] ok the writes don't overlap it uses the server_id, the checks are run by icinga [15:39:42] yes :-) [15:39:56] I'm thinking at today for example, I changed the local master on codfw for es2/3 [15:40:14] well, you didn't touch the original master [15:40:35] and I will tell you- giving that touching it would mean downtime, it really doesn't matter [15:40:36] no but if we had them running on both local masters (eqiad and codfw) [15:41:04] if we check only from the primary master probably doesn't matter [15:41:34] I am *not* suggesting we should have that, but that for the process, we could do it with no problem [15:42:23] actually I check now based on shard, so with dual pt-heartbeats, we would get (in general) the one from the local master [15:42:56] see my other patch: [15:43:08] https://gerrit.wikimedia.org/r/274680 [15:43:26] "shard = ? ORDER BY ts DESC LIMIT 1" [15:43:28] if you don't specify the server_id it get's automatically it's direct (aka local) master [15:43:51] yes, that is what is happening now [15:44:09] the above patch will change that to use the one in the shard [15:44:40] that will avoid to say to every single node "this is your master" "this is your master" [15:44:49] so I am decoupling as much as I can [15:45:10] but we need to still run pt-heartbeat somehow, and not manually [15:45:31] if this is properly documented on the failover, it should not be a huge issue [15:46:30] we can later migrate it to the orchestration tool that we do not have :-) [15:46:45] I was thinking of an alternative, but I still don't like it, run pt-heartbeat everywhere from puppet, first thing checks if it has a master, if does, exit, puppet will retry in half an hour and so on [15:47:03] everywhere? [15:47:29] that doesn't work [15:47:40] eventually, all servers will be slaves, even the master [15:47:54] (think circular replication) [15:48:19] that was my follow up question, we didn't yet talk about the final topology with both DC running :) [15:48:26] https://phabricator.wikimedia.org/T119642 [15:49:15] we could do that now already... if it wasn't for tendril beaing broken on circular replication, and not trusting yet someone doing "tests on codfw" [15:50:11] ok, so classic multi-master active/passive [15:50:34] yep [15:50:42] eventually, something more like [15:50:44] https://gerrit.wikimedia.org/r/274680 [15:51:03] but we are not yet there, specially when failover is in 2 weeks [15:55:37] BTW, there is not going to be "both DCs running yet" [15:55:57] (for mediawiki) [15:56:28] that would be the following step after the failover, and it will not be easy [15:56:52] ok I don't have better ideas for now, just ugly ones... if something come to my mind I'll let you know [16:00:17] we could have it installed on all servers and run it manually, but I do not trust myself with it [16:01:26] I still accept reviews: https://gerrit.wikimedia.org/r/274670 [16:10:35] for gerrit 274680, the logic looks ok, but I cannot guarantee on the perl side :-P [16:10:58] well, I checked the perl manually on production [16:11:11] but I will downtime all replication check for 2 hours [16:11:19] to avoid a mass-paging [16:11:49] tendril still uses the clasic seconds behind master, so use that for the time being [16:12:19] btw, db1048 would have paged again (lagging) [16:13:38] what is your preferred go-to script language? [16:13:58] for something larger than one-liners, I mean [16:18:41] usually python, but happy to adapt, perl is really the one I actually was able to avoid along the line :) [16:18:57] no, actually I prefer python [16:19:27] it is what I am writing my things- but that doesn't mean I will rewrite a script that worked for years [16:19:43] of course not [16:20:13] so, the patch worked correctly- it detected heartbeat running so it didn't restart it [16:20:22] ok [16:20:26] I am now going to manually kill and start the new one [16:22:02] (my comment was, given that we have people much more perlish than me, if you need a serious perl review there are better fits ;) ) [16:22:05] ok [16:22:38] yes, no problem with that [16:22:58] I just wanted to have some supporters to write more python tools [16:23:12] yay! [16:34:44] I will create a big fat section on pt-heartbeat on wikitech, so that if for any reason it fails (if we break puppet), we can run it manually [16:35:05] sounds a good idea [16:35:33] all of this is not only for us, mediawiki will depend on this soon [16:36:03] because otherwise is not possible to see the real lag on multi-datacenter slaves [16:41:10] "luckily" because mediawiki knows the topology it will be able to check the right one [16:41:29] not really [16:41:41] the whole shard thing, I created it [16:42:02] because it doesn't know about the id [16:42:21] and it had to a) query every time, which was a big no [16:42:50] b) store it on memcache/app cache, which created worse problems [16:42:59] of sync on failover [16:43:17] shard is purely static- you cannot just "go to a different shard" [16:43:23] jynus: here's a random thanks for being awesome comment. [16:43:54] lol, random indeed! [16:46:03] lol, and ok :) [19:28:39] (log) all running copies completed, restarted replicas on both old and new, added new to tendril