[11:46:56] hey again jynus ! :) One more question for you regarding https://phabricator.wikimedia.org/T130067 once it is started how long will it take? ie, from the time starting the first wiki to finishing the final wiki? :) [11:52:14] it is difficult to say without testing or seeing how large is the watchlist table [11:52:55] okay! :)_ [11:53:11] the logging and user table took 5 days to reconstruct per server (there are 150 of those, but some can be done in parallel) [11:53:51] but we can test them on codfw, assuming the code is already compatible [11:54:23] I will have to reimport the watchlist table on labs soon, I can tell you more precisely after that [11:55:19] okay! awesome! [11:58:08] 144 million watchlist rows vs 74 million logging and 28 million user rows on enwiki [11:59:54] so depending on the situation, think more than a week, less than a few months [12:03:42] okay! [16:30:44] jynus: I'm ready to merge https://gerrit.wikimedia.org/r/#/c/288420/ and restart mysql on db2017 if it doesn't conflict with other stuff you're doing [16:31:44] wait a sec [16:31:55] I am applying a schema change to s2 [16:32:10] let me see the state (if 2017 has been done already) [16:32:16] ok [16:32:49] yes, it has been done already- just downtime all replication-related stuff [16:33:01] on s2-codfw-slaves [16:33:03] ofc [16:33:36] also s2-master heartbeat was restarted earlier [16:34:39] will pt-heartbeat on db2017 needs a restart after mysql restart? [16:34:54] no, puppet will do it [16:35:03] I am writing about it [16:35:12] if you can wait 1 sec [16:36:04] sure :-) [16:36:52] volans, https://phabricator.wikimedia.org/T133339#2305719 [16:37:00] ^I want to resolve that ticket [16:37:54] the key part is "assuming we do not bring down both masters at the same time" we do not have to do anything anymore [16:38:13] puppet will just take care of it [16:38:43] yes, just replace every 30 seconds with minutes ;) [16:38:52] ja [16:39:31] so, more or less undertood the system and agree to close that? [16:40:00] now, just restart the master and let puppet do its thing [16:40:05] just one last thing, but I guess depends on the MW side of the check... [16:40:21] mw is still not checking on pt-heartbeat [16:40:28] when you say going to read-only means that if both pt-heartbeat fails MW will assume RO? [16:40:35] yes [16:41:27] in my scripts I failback to show slave status [16:41:27] in this case is probably worth to have a nagios check for 1 proc pr-heartbeat on the masters, no page (IRC and in the future email) [16:41:34] no need [16:41:44] if that happens, we will get an alert anyway [16:41:51] for the lag, true [16:42:03] actually... will we get it? [16:42:10] the other pr-hb will still work [16:42:14] and millions of people will complain and we will get a call [16:42:24] no I'm saying if one fails [16:42:38] if one fails, the check will work [16:42:49] it uses max(timestamp) [16:42:59] where shard='x' [16:43:17] it may get slightly higher one [16:43:33] but I checked and it is like 0.001 seconds higher [16:43:37] exactly and we will not be aware that one failed, let's assume puppet restart it in 30 minutes, but it fails again... we will be running with one pt-hb only without noticing [16:43:39] (under normal circunstances) [16:43:50] well [16:43:58] then puppet will restart it in 30 minutes [16:44:06] or puppet will fail [16:44:20] if puppet fails, we get an alert [16:44:57] sure, I was thinking in the case of exit code of pt-hb is 0 because it daemonize correctly and then fail for I don't know a timeout or a permission, whatever [16:45:01] too paranoic? :) [16:45:08] it can happen [16:45:25] but then it goes to "If an operational error causes it (e.g. a bad schema change or permissions), things will fail anyway." [16:45:33] which is the millions of people angry [16:45:49] to be fair, mediawiki check is not live [16:46:03] it will be tested for a long time to try to identify those kind of issues [16:46:11] only on one shard [16:46:55] we can add a check to "pt", but I think it is an overhead to add a check for a check [16:47:11] which is already used for generating alerts [16:47:34] lol, that's true, start to be a bit convoluted [16:48:22] the question is, can it fail, yes? Can we fix it perfectly? probably not. But I am way more confident now (2 independent hosts on 2 datacenters) than before [16:48:46] and operational errors will happen always [16:49:03] sure, agree [16:49:03] the right fix, in my opinion, would be to mediawiki to failover to SHOW SLAVE STATUS [16:49:34] not to make pt-hearbeat more than 99.99% available [16:50:09] fully agree, the failover is the best way to go [16:50:40] in other order of things, pt-heartbeat was running unpuppetized for 2 years [16:50:50] in screen yeah [16:51:00] so I am more confident on it than pt-kill, for example [16:52:04] and I am not saying to stop doing things (orchestration) but to close that specific issue (horrible failover steps) [16:52:24] now, with that scope (failover) there is nothing to do with mysql [16:52:51] except the read only part [16:53:24] and no need to uglyly kill it or start it [16:53:45] sure, can be closed [16:54:16] fwiw I've noticed that on s2 the 2 pt-hb are writing at the same time, they are not "randomly" distributed in the 1s interval [16:54:22] yes [16:54:25] I tried that [16:54:27] one at 0 [16:54:32] and another at 0.5 [16:54:41] they are aligned by the tool [16:54:53] but there is not offset parameter [16:54:57] I thought it was it, could not be puppet [16:55:04] no [16:55:09] it is pt-heartbeat [16:55:15] unless we patch it :D [16:55:23] (remember that I worked with the guy who created it) [16:55:40] it was on purpose to use it with the skew parameter [16:56:05] so that you knew deterministically when it was going to be run [16:56:22] we can patch the check to actually use it, so we have better idea of the lag [16:56:42] we can also patch pt-heartbeat, should not be too difficult [16:56:59] or more specifically, pt-heartbeat-wikimedia [16:57:11] but I considered it low-priority [16:57:16] super low [16:57:18] I created the table as myisam [16:57:25] so no performance impact [16:58:09] I also changed the binlog_format to statement [16:58:24] sadly that will not work if we migrate to ROW for slaves [16:58:50] changed where? [16:58:54] (but for now it will avoid to having to syncronize the tables if we do row for the masters) [16:59:15] volans, https://gerrit.wikimedia.org/r/#/c/289177/ [16:59:30] forcing a REPLACE [16:59:53] so that if a server crashes, we do not need to replicate all the chain of updates [17:00:09] just the last one [17:00:32] ok [17:00:44] or if it crashes and myisam table gets corrupted [17:07:15] volans, I will be gone soon [17:07:45] ok, I'll take care of my change [17:07:48] FYI: things ongoing- labs imports (stopping/lag on s1 labs and dbstore1002) [17:07:59] yes, saw it [17:08:00] and schema change on s2 [17:08:30] ack [17:08:30] if for any reason there is any issue, they are running on root screens on neodymium [17:09:00] schema change is fully online, but we get sometimes a spike on metadata locks [17:09:22] that should be transient [17:09:24] (do not worry to much for small spikes)