[15:53:38] jynus: wanna sync here for db-only related stuff? [15:54:04] let me take a break, if you are desconnecting early, feel free to add things to the ether pad [15:54:16] basically we have icinga alarms for replication on the old eqiad masters and x1 replication issue on eqiad [15:54:17] but I intend to work until late and I need a break [15:54:25] do you want me to ack them? [15:54:33] with a ticket, yes [15:54:36] ok [15:54:48] I'll be around anyway until late I guess [15:54:53] I suppose some kind of incompatibility with older class [15:55:08] db1001 and some other puzzless me [15:55:28] x1 is probably the replication 10-> 5.5 [15:55:50] is ROW -> STATEMENT [15:55:53] too [15:56:40] ah! [15:56:45] then easier then [15:57:05] I said that at some point but I guess was missed in the flooding of alarms :) [15:57:06] but we need to create a plan [15:57:19] yeah, I didn't care much about eqiad [15:57:23] at that point [15:57:34] I was fighting puppet [15:57:48] The old master actually may not even be a real error [15:58:11] only that they miss heartbeat, becuase they are still appointed as masters in the old class [15:58:59] the old s* masters error is NRPE only, they are in sync [15:59:21] let me look at puppet, I may be able to fix it [16:14:27] so for the replica on s* the issue is a missing DNS, the one you wanted to created... so my fault :) [16:15:04] ? [16:15:08] the s1-master [16:15:10] ect? [16:15:21] not your fault, I skipped that step [16:15:28] thinking it was not needed [16:15:32] so /etc/db.cluster contains s1 and the code use masterdom = '.<%= @mw_primary %>.wmnet' as domain [16:15:46] nah [16:15:51] we should kill that [16:15:53] not fix it [16:16:14] I did not detect that on the latest failover [16:16:29] because I failovered to a "new" class [16:16:40] I can just ack, they will go away with the master failover and if there are real issues we have the alarms on all the slaves [16:18:04] +1 [16:18:38] I would focus on x1, (in case an emergency failback was needed) [16:19:02] but I really need to take a break, will be back soon [16:20:45] ok ttyl [17:04:54] ok, let's see [17:05:13] I will stat by changing tendril to get a better overview, as I said before [17:06:09] ok, I was just about to cleaning some space on pc2005/pc2006 then we need to fix x1 [17:06:45] and take a look at labsdb1001 that is lagging, from before the switch tbh [17:06:49] x1 should only be a binlog change? [17:06:58] if you think it's safe yes [17:07:17] I will look at it later, unless you want to do it first [17:07:26] IO thread is running [17:07:35] so there should not be data lost [17:07:47] ok [17:08:45] let's get rid of the alerts [17:08:56] then we can start planing maintenance/reboots [17:18:16] LOL at s1: https://tendril.wikimedia.org/tree [17:18:56] lol [17:18:58] and argh [17:19:08] now it doesn't show designated masters... [17:19:17] what? [17:19:19] yeah, it is a library problem [17:19:28] it only supports trees [17:19:32] not general graphs [17:19:42] so I cut when it detects recursivity [17:20:00] things will look better after eqiad failover [17:20:07] within-eqiad, I mean [17:21:09] I think this particular code will be able to be reused, but we need a different frontend js graphing libary [17:21:57] paravoid, no production issues, do not worry, I am changing monitoring to identifies real issues better [17:22:08] cool [17:22:49] that mariadb slave lag page-spam was noisy, it'd be sweet if we didn't experience that on Thursday [17:23:04] but false positives are obviously less important than real alerts :) [17:23:56] false positives might hide real problems, better to get rid of them at least with a scheduled downtime if we don't have a proper fix by then [17:24:04] we can check the replication from tendril [17:24:09] as we did today [17:24:11] yes, that definitely will not happen, but I do not know how to go- that change has to be done transactionally [17:24:23] among all servers [17:24:29] and puppet is all but transactional [17:24:56] what is worse, mediawiki will switch to that method soon, so it would have been a real isse then [17:25:26] I think the long term solution to all issues is to store the master externally [17:25:42] or [17:25:44] jynus: probably make sense to make something around pt-heartbeat management that reads from etcd for example [17:25:55] and handle where to run it [17:26:09] failover pt-hearbeat to regular checks [17:26:41] but that again involves several nodes- a slave doesn't know the difference between "the master is not running pt-heartbeat" [17:26:47] and "I am lagging" [17:27:16] one solution could be to runn 2 instances simultanously, but I wonder if that could cause race conditions [17:28:12] paravoid, in reality, most of the issue will be gone becase we were not properly maintaining old puppet classes (non-mariadb module) [17:28:37] :D [17:28:43] once that is gone (tomorrow), things will go smoother [17:28:44] [FYI] for db1001 I've ack'ed the check, same issue of old masters, but we're missing lag checks on slaves: https://phabricator.wikimedia.org/T133062 [17:29:10] volans, that is probably on purpose, because the slaves I think are unused [17:29:27] unless there is a failover, in which case, the proxy would alert us [17:30:10] ok, in that case feel free to close it [17:33:35] for x1 we have different issues, on the master (db2009) I don't see pt-heartbeat running [17:34:50] hence I guess the alarm on codfw hosts too [17:38:20] mm [17:38:51] puppet miss, or maybe class misconfig [17:39:23] I have changed tendril hierarchy now [17:39:58] it shoud refect on https://dbtree.wikimedia.org/ too [17:40:15] confirmed [17:41:53] I will look at puppet now [17:41:57] look at select * from heartbeat.heartbeat; on db2009 [17:42:03] there is a stale row from 2014 [17:42:09] what? [17:42:18] no recent rows at all? [17:42:22] with shard NULL [17:42:37] and one which last update was 14:38:15 today [17:42:49] from the old master [17:42:57] ok, that makes more sense [17:43:21] probably it's just that pt-heartbeat was not started [17:44:08] because of missing $master, I'll add it [17:44:08] puppet was wrong it is role::mariadb::core [17:44:14] yes [17:44:15] but we missed the master [17:44:25] probably because all other were coredb [17:44:37] I'll send a CR [17:44:49] if you take care of that, I will look at other things [17:45:23] but aside from that, replication is really broken to eqiad, looking [17:45:38] sure, for the eqiad master replica you looking? should just be the binlog format [17:45:50] ok, will double check [17:46:11] because the error doesn't fit on icinga [17:46:25] eheheh [17:46:56] yeah, agree then with just changing it? [17:47:05] There is a chance of things breaking [17:47:12] but then it is too late [17:47:20] because they have been already logged [17:47:36] so I would change it, and reimage otherwise [17:47:50] change to ROW/ [17:47:50] ? [17:48:12] yes [17:48:18] agree? [17:48:45] yes [17:48:46] I mean, all other replications are working 10 -> 5.5 [17:48:52] so the change would be small [17:48:56] we cannot undo to STATEMENT db2009 [17:48:58] *change of breakage [17:49:03] yep [17:49:09] we could do it from now [17:49:20] ok, logging and doing it [17:49:50] ok [17:53:34] I'll merge https://gerrit.wikimedia.org/r/#/c/284243/1 [17:55:33] +1 [17:56:06] I'm running the compiler given it's a master now, just in case [18:02:50] there are spikes of lag coming from s1 master codfw -> eqiad, up to 20 seconds [18:03:01] https://tendril.wikimedia.org/host/view/db1052.eqiad.wmnet/3306 [18:03:39] up to 15 seconds there were on the other way around too [18:06:18] can I delete the stale row on heartbeat.heartbeat from db2009? [18:07:18] doesn't hurt [18:07:42] I like to keep those to keep track of the historical masters [18:07:50] but I do not care much [18:07:58] ok, I'll leave it there, no prob [18:08:08] seems to work, should recover all of them [18:08:33] and probably db1029 has the same issue than the other 5.5s [18:08:40] now [18:08:49] yes, ack'ing [18:09:37] so is that all immediate problems? [18:10:59] I guess so, no more alarms from DB stuff that is not ack'ed because of broken check, no real issue or fixed that I know off [18:11:06] only labsdb1001 is lagging a lot [18:11:09] I'll take a look [18:12:12] I stopped the importing before the switchover, maybe something has it locked [18:12:44] the Innodb_history_list_length is small, only s1 is lagging [18:13:11] Waiting for table metadata lock [18:13:14] checking [18:14:03] yep [18:14:34] yes deadlock between /*!40000 ALTER TABLE `categorylinks` ENABLE KEYS */ and something else I'm checking innodb status [18:15:10] it is just htat [18:15:15] it pileup selects [18:15:29] on that table [18:15:49] there is a DELETE FROM enwiki_p.page [18:15:55] sorry [18:16:03] DELETE FROM xxx SELECT FROM enwiki_p.page [18:16:35] at this point I would kill long running queries- it would be worse anyway [18:18:26] I think is that delete the select is: SELECT 1 FROM enwiki_p.categorylinks WHERE cl_to = 'All_disambiguation_pages' AND cl_from = mo_id :( [18:19:12] can I kill this delete? [18:19:55] do it [18:20:35] I switched from truncate to DELETE to avoid metadata locking [18:20:45] but I still need to disable ENABLE KEYS [18:28:38] ok was a bit more trickier but I unlocked it, opening a task [18:28:46] for the user [18:35:01] unless you disagree we can plan eqiad failovers, let me reopen your etherpad [18:35:29] ok, to be done when? tonight or tomorrow with a more fresh mind? [18:35:46] I would plan it now [18:35:58] then see how much it will take [18:36:04] and decide [18:36:23] ok [18:36:25] I am unsure about some work, because it may need somre reboots/upgrade [18:37:00] there are hosts not on 22/23, which are the latest versions with TLS support [18:37:25] so they need more work than just a restart [18:38:10] yess [18:38:29] I cannot find it now, can you give me a link privatelly [18:43:54] your patch needs manual merge, with all the changes we have done [18:44:05] yes I need to merge it locally and re-send it [18:44:17] lets focus on the plan [18:44:17] I'll do after dinner if that's ok [18:44:21] ok, [18:44:23] no problem [18:44:29] just 5 minutes later [18:44:38] what time more or less? [18:44:49] ??? [18:45:11] have dinner, I mean [18:45:21] and see you in 1 hour? 2? [18:45:33] and in 5 minutes we discuss what is pending [18:45:57] ah ok :) sure I'll go for dinner in ~5 and be back in less than 1h [18:46:05] no rush [18:46:12] see you later! [18:47:06] <_joe_> volans: I should've invited you over here today [18:47:17] <_joe_> hot meals were being served at my desk :P [18:47:22] rotfl [18:47:39] I got a snak in the afternoon at my desk :) [19:48:17] * volans back [19:49:41] could we have avoided the connection pilup https://tendril.wikimedia.org/host/view/es2018.codfw.wmnet/3306 with an extra server? [19:50:03] or maybe some warmup? [19:50:42] dunno, because when I connected I saw a thousand connections, but no stale queries IIRC [19:51:00] so if it's related to the empty memacached [19:51:23] it could be an order of magnitude greater of what the es* could handle [19:51:24] but only 40K requests failed on the 2 servers [19:51:46] from where we got this data? [19:51:54] ^link [19:52:29] we happened to not repool es2019 on time [19:53:24] anyway [19:53:59] it could be that it would have been enough [19:56:08] we divide the shards, work on them independently? [19:56:26] in whatever schedule we want? [19:56:40] you know, for me if the logic of my script is correct es2019 it's ok. Because or we trust it or we need to reimport it at the end :) [19:57:07] lets repool it, it is too late anyway [19:57:14] but that can wait now [19:57:34] sure make sense, let's agree on the procedure and then we can work on them independently, the only missing part in my etherpad for the repl.pl part is for dbstore1001 [19:57:50] (shards) [19:58:14] dbstore can hang in a third tier until we work on it manually [19:58:44] I had to check the binlog position manually last time [19:59:19] ok [19:59:20] the problem is those hosts that are outdated [19:59:49] in some cases, they are api servers thay I purposely left outader because they have some special query rewrite [20:00:32] I suppose it is time to either compile with that or fix the app properly [20:00:47] so s1 has 4 outadated servers [20:00:57] actually more, 6 [20:01:18] s2 are all up to date [20:01:46] 1 for s3 [20:01:58] 5 in s4 [20:02:02] outdated is < .22? [20:02:08] yep [20:02:17] SSL was compiled since .22 [20:02:47] so the issue is not how old, but that ssl will not work there [20:02:58] no need to update them all now [20:03:16] but an issue if they need to replicate with tls, etc [20:03:51] ok [20:04:01] the question is that it will be easier to restart and upgrade them now than in 2 days [20:04:19] so if we have the time to do it, better now [20:04:23] I would focus on designated masters, we need those no matter what, for slaves, more the better of course, easier now [20:04:28] yes [20:04:48] I suppose those have already been upgraded? [20:05:05] when reconfigured [20:05:41] the issue is that replication will not work to those if TLS is tried, either they have to use plain text or be upgraded [20:06:04] and the repl.pl script will fail, unless that is taken into account [20:06:19] I think it has TLS hardcoded, I cannot remember [20:06:32] so how do we divide those? [20:06:55] do you have the etherpad link handy? [20:07:37] repl.pl worked so far so I don't think will be an issue, if they replicate with TLS they will use TLS for the new one too, if plain text stay with plain text my guess [20:08:36] for the upgrade, they are surely on .22 [20:09:06] not forcely on .23 also because usually you need to upgrade all slaves first as best practice [20:09:22] I am not sure of that, but anyway, they will fail and it just needs to remove the USE_SSL=! [20:09:35] we do not care about that [20:09:53] ok [20:09:54] the reason we do not upgade to .23 is because I did not bother to compile it for tusty [20:10:03] only for jessie [20:10:11] also that, I forgot, true [20:10:31] btw where is the source code? which server? [20:10:42] ? [20:10:48] of our package [20:10:49] ah, for the server [20:10:59] it is the mariadb tarball [20:11:27] I will document it soon [20:11:39] I mean the source code from where we build the package [20:11:43] ok, no hurry [20:11:51] yes, mariadb.org, no patches [20:12:14] ok [20:13:17] so, it needs a restart to get the new certs, right? [20:13:23] the other servers, I mean [20:13:52] yes [20:14:00] that is why I said it [20:14:06] if we are going to restart [20:14:13] better upgrade at the same time [20:14:18] the only ones restarted were the designated master in eqiad and the master in codfw [20:14:21] yes agree [20:14:37] we can do it massivelly because it is depooled [20:14:44] so it is not like usual, one at a time [20:14:54] we can disable alerts and do serveral almost at the same time [20:15:00] so easier [20:15:11] s2 is the only exception, eqiad master was not restarted of course hence we need to do a CHANGE MASTER TO after the restart to use the new certs [20:15:19] yep [20:15:24] so steps [20:15:32] we need to merge my change before, I'm doing the merge now [20:15:43] stop (make sure you run innodb_buffer_pool_dump_now=1) [20:16:20] it is on my.cnf on all, but I am not 100% it has been modified on all in a hot way [20:16:32] apply puppet ( or we can do that now) [20:16:36] upgrade [20:16:40] mysql_upgrade [20:16:42] start [20:16:53] topology change [20:17:09] we can also apply ferm at the same time [20:17:16] there are patches pending [20:17:28] but no rush while it is depooled [20:17:44] do I miss something? [20:17:45] I'll add a stop slave before the stop and start with --skip-slave-start [20:17:53] and start slave of course :) [20:17:59] yes [20:18:05] also, the topology change [20:18:12] has to happen when replication is up to date [20:18:13] before or after the upgrade? [20:18:27] will be easier before starting? [20:18:31] mmm [20:18:35] before starting the process [20:18:38] I doubt it will work [20:18:45] because of the old certs [20:18:56] so I would do it later [20:19:09] with more reason for those servers that do not have SSL support [20:19:14] (before update) [20:20:29] so wait for sync and then master-change, but you can disagree, I may be confising myself [20:20:52] but if we don't change the topology we'll be replicating from the old master after the upgrade, so if we have issues we'll have them anyway I guess [20:21:03] or you are worries that repl.pl will fail? [20:21:25] ah [20:21:34] you want to move the old one first [20:21:43] sorry, the new one first [20:21:55] no issue with that [20:22:03] I knew I missed something [20:22:07] it wasthat [20:22:11] right? [20:22:17] the new one? [20:22:22] I'm not sure I'm following [20:22:38] so the designated one is not replicating directly from codfw [20:22:47] and you were right, you added SSL=1 on the --switch-sibling-to-child [20:22:50] we have to move it as a slave [20:23:13] no, but we have to that last [20:23:40] because we do not have an uncle-to-child [20:23:45] can we do that with repl.pl or we need to do that manually? [20:24:03] what part? [20:24:17] move designated master under codfw master [20:24:34] like db1057 on s1 from being slave of db1052 to being slave of db2016 [20:24:36] we have to do that as the last action [20:24:37] I think [20:24:53] first move the siblings, then the designated [20:25:06] so 3 tiers temporarilly [20:25:09] agree, just wondering if we can do the last step too with repl.pl [20:25:21] yes, why not? [20:25:34] what would be the issue? [20:25:48] probably none :) [20:26:02] there is an even 3rd step [20:26:16] which is move the current master as a child [20:26:36] so, to summarize [20:26:41] yes if not reimaging it [20:26:52] write in the etherpad [20:26:57] yeah, but we have to move it anyway [20:27:05] I would try to upgrade it [20:27:13] because being the master for so long [20:27:35] we may want to use it to compare with the other hosts [20:27:52] for example, old s2 master is 24, which is now the designated master [20:29:21] so, summarizing it [20:29:40] let's assume we have only db1052 [20:29:47] db1047 [20:29:57] not that one, that is multi-source [20:30:04] db1051 [20:30:09] and db1057 that is the designated master [20:30:14] and db1057 [20:30:17] yes [20:30:33] and db2016 :) [20:30:34] so we move db1051 as the child of db1057 [20:30:50] then we move db1057 as the child of db2016 [20:31:06] then we move db1052 as a child of db1057 [20:31:16] (and will will reimage or whatever [20:31:20] this last one could file for the certs [20:31:35] ok [20:31:38] no issue [20:31:50] we will see that last step what we do [20:32:00] I mean it can be adjusted with CHANGE MASTER TO [20:32:03] remember we can restart any server at any time [20:32:17] we can stop them, it is not like before [20:32:22] that we could not touch the master [20:32:51] the ideas are clear, yes? [20:33:01] yep [20:33:16] so which ones do you want to do? [20:33:43] they are 7 [20:33:57] it's the same [20:34:00] although s1 has more outdated servers, and more of them [20:34:14] s1 counts as two :) [20:34:28] ok, so s1-s3 and s4-s7 [20:34:33] chose partition [20:35:20] and when do you think we can have it done? [20:35:48] probably by the morning tomorrow? [20:36:17] I wanted to go to sleep soon today [20:36:38] I mean starting tomorrow morning, done by lunch [20:36:42] you need to sleep tonight [20:36:44] yes [20:37:01] ok by tomorrow evening more or less [20:37:32] if we don't encounter blokers yes I guess so, doesn't seems too long to be done [20:37:45] if everything goes according to plan, we try x1, es* and pc [20:37:54] ok [20:38:08] but those are lower priority [20:38:36] also focus on the largest servers [20:38:56] another thing [20:39:08] the smallest ones may be with us only for a few moths more [20:39:47] we had some overheating issue, and the idea was to get chris apply the thermal paste tomorrow afternoon, what do you think? [20:39:58] if you didn't see the task I'll give you the link [20:40:09] ok, which servers were those 6-something, right? [20:40:18] 3: https://phabricator.wikimedia.org/T132515 [20:40:39] mmm, 7s [20:40:59] those should match the maintenance time [20:41:04] yes [20:41:19] to avoid restarting them twice [20:41:29] also the pending ferm patches [20:41:45] and if we have lots of time (which we don't) [20:41:58] we can even reimage some server to jessie [20:42:07] ferm it's just adding include base::firewall? [20:42:08] but I do not think that is feasable [20:42:12] yes [20:42:21] but there are patches already done [20:42:31] the issue is that when applied [20:42:43] network goes down for 15 seconds [20:42:55] so I freezed that until the failover [20:43:26] I do not know if you knew that [20:43:33] nope :) [20:43:49] that is why it wasn't done before [20:44:13] ok, also rotate the error log if big, I've added a note in the etherpad [20:44:36] that is https://phabricator.wikimedia.org/T120122 [20:44:50] you can add the error log there if you want [20:45:04] jessie is not a hard blocker [20:45:18] we cannot install 100 servers in 40 hours [20:45:32] and some will not be worth because they will be decommisoned first [20:45:32] eheheh no [20:45:54] but remember that it may take months before the new ones are production ready [20:46:10] that is why we shouldn't count on them yet [20:46:41] and of course, the ROW is a no-go until labs blocker is resolved [20:46:48] and other blockers [20:46:57] well, I think it is all [20:47:16] just log when you start working on a shard (I will be doing the same) [20:47:26] and that way we can work at the same time [20:47:31] ok? [20:47:39] ok, no prob, we need to merge first puppet though [20:47:57] I'll put the steps we agreed above in the etherpad [20:48:29] I am saying these because I will be leaving soon, and try to wake up soon tomorrow [20:48:41] you want me to merge tonight? [20:49:11] as you wish- you can do what you want- but do not create alerts [20:49:26] we talked preciselly so we can work at any time [20:49:41] if you hadn't merged when I wake up, I will do it [20:49:50] ok [20:50:07] let me recheck the patch [20:50:30] it is https://gerrit.wikimedia.org/r/#/c/283771/ [20:50:53] but that is only for the masters, right? [20:51:21] if we reboot the server, we just add the same ssl parameter, right? [20:51:49] knowing all the issues (double cert, etc) [20:51:52] I've added the puppet-cert parameter to all core DB in that patch [20:52:10] oh, I see it now [20:52:10] so that when we restart mysql it's with the new [20:52:11] sorry [20:52:27] the regex + diff format confused me [20:52:35] no problem [20:52:38] thank you [20:53:18] I'm re-checking it too and probably will run the compiler, to be sure [20:53:30] I'll add the link to the results in the CR [20:53:39] I think I retire soon, and will do my part tommorow early morning [20:53:57] if you need to commincate somethin, either log or just paste here [20:54:05] me too, too tired to do delicate stuff now [20:54:24] I just wanted to organize the pending maintenance [20:54:28] and we just did [20:54:33] it took a bit more than 5 minutes [20:54:43] but it wasn't preciselly obvious [20:55:10] nope [20:55:10] things will get better when all servers are configured and no servers with depracated classes [20:55:18] :-) [20:55:35] it was even worse before! [23:13:57] j.y.n.u.s: 1) https://gerrit.wikimedia.org/r/#/c/283771/ not merged, puppet compiler result linked in the CR [23:14:07] 2) etherpad updated with all steps [23:14:22] 3) FYI https://phabricator.wikimedia.org/T133122