[08:03:45] jynus: are you working on dbstore1002? replica stopped on s1 for duplicate entry [08:07:16] from backlog it seems your importing data from dbstore1002, not to it... but double checking [08:15:15] yeah, that is something else [08:15:27] but I am trying to fix it reimporting from the master [08:16:31] what do you mean? [08:17:17] the table archive seems to be gone out of sync [08:18:37] probably because unsafe statements (insert..select) + STATEMENT replication + TokuDB + locks [08:19:09] probably [08:22:13] there is a chance of out of band writes, too [08:22:39] that is the 1 of the 2 only slaves that is not read only, due to analytic usage [08:23:38] but the failures seem to have to do with the archive table, which is highly conflictive due to being used always as INSERT...SELECT [08:24:00] something that should be banned or ROW replication should be used [08:42:52] one thing you could help me with is db1047 [08:43:03] sure, tell me [08:43:12] I restarted it yesterday, due to full aria corruption [08:43:21] like the one on dbstore2002? [08:43:25] one of those [08:43:36] great [08:43:38] that is ok, and in general everthing is ok [08:43:41] saw in the backlog [08:43:54] I converted the table to tokudb, it should not be an issue again [08:44:15] the problem now is that replication is behind [08:44:28] it seems that write activity is lower than yesterday [08:44:41] so low that in 24 hours it is 22 hours behind [08:44:53] not cool [08:45:21] only s1? s2 is stopped [08:45:23] try to see why- tokudb issue? too much load? swapping? storage issue? [08:45:30] I stopped manually one of the 2 [08:45:38] to see if I could catch up the other [08:45:44] ok, looking [08:46:10] (e.g. using the buffer for only one database) [08:46:34] it could be a storage issue, that RAID had problems in the past [08:46:51] I/O wait are around 18% for a start [08:46:55] the question is, whatever it takes, try to make it faster [08:46:58] ok [08:47:18] no matter if it compromises the stability of the data, it is not a core production server [08:47:25] but it is used by analytics [08:47:49] try to do it without restarting the server, I did it yesterday and people were not very happy [08:48:04] if it is needed, it is ok, but we need to say so before [08:48:31] basically check the logs for something wrong like corruption, or RAID issues [08:48:59] either the iops are going somehere, or we have lost them [08:49:57] so the current cache policy for 2 out of the 3 disks is WriteThrough instead of WriteBack, it might explain it [08:50:53] the reason for that is probably battery issues [08:50:59] which I saw in the past [08:51:02] override that [08:51:29] there is no original data there (and if it is, people should have a copy) [08:51:36] it may have been reseted on restart [08:54:44] yes, battery for adapter 1 is faulty [08:55:07] there is a similar command on the troubleshooting section of mariadb [08:55:33] just override it, we will order a repacement of that in the next batch [08:56:29] I think I disabled the learning mode on all RAIDs also, but be aware of previous issues with that in the past [08:59:27] done [09:01:32] checking if recover more quickly [09:01:34] ok, now wait and see on tendril if io goes to similar levels/replication starts decreasing [09:01:37] :-) [09:05:30] TokuDB is an issue, but only 20% because it is TokuDB, and a 80% because it is not InnoDB [09:06:33] I know this is is probably hardware, but in the end, I do not think compression is worth all the issues [09:08:36] lol, btw db1047 has also an LVM logical volume across 2 HW RAID virtual disks [09:10:12] doesn't seems to have speed up much, I'll continue to look at other way to speed it up [09:18:05] at least it inverted trend, starting to recover [09:18:39] dbstore1002 should be ok now [09:18:55] and replication going down [15:29:56] volans, dbstore1001 is at 91% disk usage [15:30:21] that is ok-ish because it is 700GB free [15:31:17] ok [15:31:18] but we may want to check potential optimizations/disk saves [15:31:25] or a long-term plan [15:32:51] there is 353 GB unallocated, just in case [15:48:38] volans, you should be able to see: https://phabricator.wikimedia.org/T123379#2164999 and https://phabricator.wikimedia.org/T131363 [15:49:02] you were asked to be on the loop, and probably will need to be involved a bit [15:49:47] yes I've access [15:50:35] take a look at it, because they may ask question such as "is 7.6TB ok, or do we go for 11.5TB" [15:50:59] and it should not be blocked for such easy questions, you can answer those [16:13:56] If you have the time, could you add a 1 line to wikitech? "mysql servers should be added to X property on hiera:path" [16:15:15] in the setup of new replicas? [16:18:17] anywhere, do not give it too much thought :-P [16:18:50] whereve you can tell me "jynus you did wrong because you didn't followed the conventions" [16:19:07] lol [16:19:24] * volans needs to look more at how the whole hiera works [16:19:35] didn't you modify hiera?? [16:20:24] we should use it more, but every time I do somthing dirty I say "It is only temporary, so it is not worth it" [16:20:39] and then it stays like that for years [16:21:50] yeah I modified it [16:26:17] done: https://wikitech.wikimedia.org/wiki/Setting_up_a_MySQL_replica#Puppet [16:28:54] I will try to update the old coredb module there too at some point [16:29:56] the whole "topology in puppet" is deprecated [16:30:42] (we would not even want to mar*k the masters on puppet) [16:31:46] true [16:59:23] I've added the expose puppet cert to the coredb role too, it's just generating the files, they will not be referenced in the my.cnf for now [17:02:30] sure [18:46:29] should we wait to restart the replica on s2 on db1047 until s1 catches fully up? I think they can run together at this point [18:47:51] sorry, I didn't control it, is it catching up? [18:48:16] if yes, just start it when you can (or I will do it tomorrow) [18:48:24] not a big deal [18:50:26] is catching, but still very behind [19:14:01] I've restarted slave on s2, I'll keep an eye on it, if it slows too much s1 I'll stop it again [19:15:11] it is ok, replication should not be a huge issue [19:15:18] lag I mean [19:15:40] there, analytics use it for long-running queries mostly [19:16:18] (I got worried when it got to 24 hours and going up) [19:16:30] yep