[10:21:14] * volans restarting replica on dbstore1001 skipping the various missing hitcounters [10:31:24] jynus: given that you're around, I would like a second pair of eyes if you have a minute [10:32:07] alternatively, we stop replication and put a filter on %.hitcounter et al [10:32:19] yes? [10:32:56] sanitarium:3311... I found a broken replica now and I was trying to fix it [10:33:11] mmmm [10:33:15] so the first error was a duplicate key on revision, given that sanitarium too on enwiki [10:33:19] not the first time it happend [10:33:25] has the primary key rev_page, rev_id [10:33:39] not sure if tokudb issue on insert [10:34:03] I've checked the content was exactly the same of the existing one except for rev_text_id that is forced to 0 by the trigger [10:34:07] a key issue due to unsafe statements due to insert select [10:34:15] so I decided to skip and restart... BUT [10:34:30] or revision is in a bad state [10:34:37] do not give it too much thought [10:34:46] because I have not yet reimported it [10:34:46] the problem was not that [10:34:50] was after the start slave :D [10:34:54] Relay log read failure: Could not parse relay log event entry [10:35:00] ah, yes [10:35:22] so I'd like a second pair of eyes before doing a change master to force the redownload of relay from the master [10:35:24] just reset slave all; change master (obviously, save de coords first [10:35:51] that is why I think it is probably a tokudb / version issue [10:35:57] that used to happen a lot [10:36:02] no only there [10:36:18] which means it could be <= .15 related [10:36:32] so basically you were doing what I did [10:37:03] yep, just that one thing I cannot explain [10:37:04] just make sure you use the right master log [10:37:28] on the mysqlbinlog on the master at that position there is one more query before the one that breaks [10:37:52] I'm looking at Relay_Master_Log_File: db1057-bin.002016 Exec_Master_Log_Pos: 1035866585 [10:38:59] db1057-bin.002016 [10:39:09] not 18, which is the downloaded [10:39:11] right [10:39:30] yes of course [10:39:50] this is one of the times in which we do not do things 100% well because they are going to be reimported anyway [10:40:11] there is like 20 reports of data differences [10:40:23] in part also because the filters [10:40:59] however, my plan to simplify both labs and dbstore management [10:41:23] is to get rid of toku- that would allow to simply copy raw table files [10:42:35] yeah [10:45:19] ok, I'll proceed :) [10:45:49] I cannot login now to phab, but I mention because I searched procurement agreement on dbstore-like task [10:46:18] that should fix it, at least for a few days [10:46:44] we must thing on upgrading that- if possible, directly to 10.1 [10:46:50] *think [10:47:09] it would help a lot, yes [10:47:49] the other thing I want to mention (despite logging) [10:47:57] was the x1-codfw [10:48:19] I found an unused piece of hardware and reimagined it [10:48:26] *reimaged [10:48:46] sorry you did db2008 for nothing [10:49:00] (actually, it helped) [10:49:09] no prob at all :) [10:49:28] but db2033 has better, on warranty hard [10:49:51] and the more old machines we can get rid off, the better [10:51:57] on https://tendril.wikimedia.org/host/view/db1069.eqiad.wmnet/3311 [10:52:06] you can see this happened a number of times [10:52:20] plus one time where update got "stuck" [10:52:37] yes [10:52:40] (weekly replication lag) [10:52:58] restarted, is recovering now, all looks good [10:53:24] my bet on bug [10:53:54] maybe only happens on 10 master [10:54:23] these are the things in which slave version > master [10:54:27] does matter [10:54:50] yes, make sense [10:55:13] although it already had all the relay logs locally [10:56:00] I do not have counter-argument to that :-) [10:56:30] but the fact is it is the only .15 enwiki slave [10:57:21] together with labsdb1001 true [10:59:30] labsdb1001 is not a direct slave, does not count [11:00:23] bump, at your discretion, the priority of the sanitarium upgrade ticket [11:00:32] yeah it's just a slave of sanitarium... [11:01:23] it's already high :) (the ticket)