[14:47:41] there is some issue with the slave delay on dbstore1001 [15:06:28] jynus: re: https://gerrit.wikimedia.org/r/#/c/285208/ [15:06:48] as the script is currently coded, the check_procs trips on every single invocation of it, making it fairly useless (and noisy) [15:07:07] I won't attempt fixing el_sync.sh again, but if you're not going to work on it, I'll just remove the icinga check [15:10:22] ok [15:11:39] I am just saying that if you merge a change, checking it running it so it doesn't run into an infinite loop and fills the filesystem is a good idea :-) [15:12:28] I am the first one that makes mistakes, and that script is inherited from Sean, had no time to work on it [15:13:40] I support volan* call of reverting it [15:14:17] never disagreed on that [15:14:54] and yes, that script is horrible :-P [15:14:55] you should had deployed it instead probably, let's do that next time :) [15:15:28] but at least now it is horrible on puppet, and not on a screen session [15:15:48] (which is how I find it) [15:18:21] spike of connection errors on db1044 [15:57:48] for some reasons, the delayed replication stops the s1 thread on dbstore1001 [15:58:13] however, the only difference between shards is that db1052 is now on mariadb 10 [15:59:52] wait, but seconds behind master is 79522, which is 22 hours [16:00:10] so stopping replication is the right thing to do [16:01:23] however, s1 tz says '2016-04-25T10:42:00.549080', which means is behind more than 24 hours [16:02:11] ah, it could be the time 52 was down [16:02:16] so not an actual issue [16:02:40] just that s1 is 24 hours behind ITS master, not THE master [16:02:55] so both seconds behind master AND pt-heartbeat were ok [16:03:08] the alert will go soon, I will ack it [16:03:26] this is like a really deep issue [16:03:50] happily needs no actionables, just wait