[09:03:16] next delayed slaves broken replica expected at 10:05 (s2), 10:10 (s3), 10:15 (s4), 10:16 (s5), 10:17 (s6), 10:18 (s7), 10:19 (x1) [09:03:39] lol [09:03:53] I do not think it will be so smooth [09:04:21] m3 failed at 4-5am [09:04:53] in the sense that is not 24h? [09:05:08] *all times above are in GMT+1 [09:05:10] this delayed replication do not work so well when there is very little load [09:05:37] I would change it, but it is not worth it [09:06:36] it seems that heartbeat is either run manually on the old slaves, or the new execution works as well [09:07:40] and the check works because it doesn't check for the executable, but for the table [09:08:02] so, here is the plan: execute it manually on the old slaves [09:08:13] no need for an extra check, because the current works [09:08:27] and if heartbeat fails, *WE* will notice [09:08:51] and then focus on the heartbeat execution/maintenance only on the new masters [09:09:00] which will be in production as sson as the failover [09:09:28] +1, we can be human puppets for the old master's heartbeat :) [09:09:29] not migrate any critical system to heartbeat (like mediawiki) until the failover [09:09:49] an until we spend a few weeks testing it all [09:10:14] make sense [09:10:35] bad things that will require solution: intermediate slaves lag [09:11:11] in the sense that mediawiki will not see the delay right? [09:11:14] but given that we should only have 3 tiers a) during failover b) for WAN replication, which do not page [09:11:26] I think it is assumable [09:12:03] we will see, we will constantly tune between noise and alertness [09:12:23] mediawiki will continue using seconds behind master until a patch is applied [09:12:35] but that is ok, because there are other blockers for multi-tier slaves [09:12:56] for example, recently I discovered that slave delay is checked using the binlog [09:13:13] and that doesn't work for multi-tier slaves [09:13:20] the position? [09:13:23] yes [09:13:26] too bad [09:13:29] not for delay [09:13:46] but for "is this slave fresh enough after writing?" [09:14:04] it should have to change to either gtid, as we will [09:14:08] or heartbeat [09:14:24] (which would provide a pseudo-gtid) [09:14:29] [s2 on dbstore2001] [09:14:33] you already there? [09:14:43] doing [09:17:09] if you need help let me know [09:17:37] it is ok, focus on your things [09:17:57] it is a waste of time 2 people doing do the same [09:23:03] so, how accurate that was to your prediction? :-) [09:24:02] not at all :D [09:24:20] better than I expected [09:24:40] you know, time is relative... [09:24:57] we didn't have into account relativistic errors [09:25:03] that was it [09:25:34] next time I'll multiply by sqrt(1-v**2/c**2) [09:26:45] better way to accelerate servers -9.8m/s^2 [09:43:46] when you have a min: https://gerrit.wikimedia.org/r/#/c/274914/1 [09:45:23] if you think it's "too much" I can do only 2 in es1, I applied the principle that they can be repooled quickly (no replica to catch) and codfw is not yet prod [09:45:33] +1 [09:46:04] the ones with load 1 are the masters, right? [09:46:13] "intermediate master" [09:46:50] yes, I we changed topology so 2009 and 2005 are local masters in codfw for es3/2 [09:46:53] it doesn't matter much, but I usually put those first in the config to avoid missunderstandings [09:47:13] e.g. we failover to codfw and delete the first line (real masters) [09:47:33] I hope you agree with that [09:47:41] make sense, let me update it [09:47:56] let me comment on the patch so it is in written form [09:49:56] BTW and FYI, codfw.php is pending a large patch: https://gerrit.wikimedia.org/r/#/c/267659/ [09:50:56] good to know [11:19:23] there is an easy way to get changed commits in submodule update directly from gerrit? [11:19:52] pfff [11:20:11] from the local machine, easier [11:20:17] I guess not according to https://code.google.com/p/gerrit/issues/detail?id=1832 [11:21:09] in theory, we should always point to HEAD, unless there is a reason not to [11:21:48] which, btw, I am not sure if you and mutante did after your last commits [11:22:47] it was not you, it was mutante [11:22:54] that did not update the main repo [11:23:25] not a big deal [11:23:27] ok, I'll keep in mind [11:24:51] so, as puppet runs first on the hosts, and later on neon, the replication lag should get filled with decimals all around [11:25:19] great [11:25:21] not that we need the decimals, but it is the confirmation that it is working [11:48:10] (for my own log) hosts than have not been fully heartbeat'ed: toolsdb (not active), dbstore[12]001 (delayed slave, it will take 24 hour to take effect), db1047 and db2002 (need a restart/change on replication filter) [12:12:07] db1048 and db2012 needs to white list heartbeat, too [18:36:17] SSL proposals on the ticket... there is no TL;DR though... be prepared ;)