[00:37:04] Krenair: we can't really do that, except vaguely if the table is trafficed heavily enough to show up in sampling [00:37:23] probably not [00:37:29] ok, never mind then [01:53:01] jynus: was thinking about db1035 issues; the last time we repooled an s3 slave was (i think) before the fix to hhvm mysql connect timeout that was creating similar log noise. it's possible the problem is not new or specific to db1035, and we just missed it before [01:53:15] lost in the background noise [01:53:57] that said, this wasn't an issue with s4 before the hhvm problem [01:54:46] potentially just another data point saying we need to split s3, as you already noted in T106847 [02:01:44] jfyi, tendril will have a data gap today. needs upgrade and reconfig for the OOM restarts [10:25:13] I can now confirm it is not a db1035-only issue [10:26:09] it is a "busy issue", and db1035 was busier than usual during warmup. It is happening both on s1 and s4, too [10:27:50] Krenair, we have per table stats, but only totals, not diffs, so you will need to query twice to get "recent stats". There is P_S that may help with more detailed stats, but it is not rolled in because of performance concerns [10:29:57] the labs sync failed partially I think because of a proper index to iterate (this is a common problem of pt-tools on our setup) [10:31:33] It got to do 20 000 fixes; now running pt-table-checksum instead to hopefully sync the rest manually [10:32:41] pt-table-sync has higher replication overhead, so we should keep an eye on the 1051->1069:1->labs replication [10:32:54] *pt-table-checksum, I meant [13:27:31] I've created T107282. I think 0.75 BP is already conservative, and works well on almost all hosts. But I would like to review memory usage for s3, labs, research/dbstore [16:57:13] getting some instances of T107265, ignore by the moment [19:20:23] I fixed T106470, I may leave the replication like that for some days and check other large tables, if you are ok with that [19:22:06] one thing I learned is that pt-table-sync will not work in many cases for us (due to lack of autoinc PKs) and *it may corrupt the master*. <-- this last is only a suspicion, I have not proved it yet