[07:29:59] morning [07:30:07] morning [07:44:23] anything new/urgent you want me to take care of? [07:47:16] new some things: reimport to labs in progress (don't worry about lag/slave sql stopped on dbstore1002/santarium [07:47:23] db1058 doesn't boot up [07:47:54] 1)ok, 2) I saw the email, anything I can do there? [07:48:09] which also means dbstore1001:s5 is stopped [07:48:50] of course [07:48:51] main ongoing thing is my my comment at :27:59 [07:49:07] which I do not want to comment publicly yet [07:50:53] ok, checking backlog [09:25:08] I've also installed sys on all p_s-enabled hosts and we can now monitor the average query time on each host [09:25:24] great [09:25:29] is it on puppet too? [09:26:11] no, tecnically sys is "data" [09:26:37] I was planning on adding it to the package, definitely not to puppet [09:27:51] ok [09:30:17] the code is on https://github.com/jynus/mysql-sys [09:31:38] and I was expecting them to do it for me :-) https://jira.mariadb.org/browse/MDEV-9077 [09:33:11] :-) [09:52:11] can I manually change those parameters on enwiki? I need to clear log noise for the db1023 change [09:52:51] sure go ahead [09:53:13] that's what I proposed earlier :) [09:54:20] oh, I understood as puppetize that, which was what I wasn't sure without proper testing [09:54:42] but of course I agreed on doing it live [09:55:21] then I misunderstood, sorry [09:56:12] no, totally my fault [10:00:03] I've been monitoring show status like %thread% on db1041 and db1057 and didn't find any different behaviour on how the counters changes [10:00:25] I've also tried to repro locally without luck so far, looking at implementation now [10:01:21] I think it has to do with long-running processes [10:01:40] it started on s4 when we had more of those [10:01:54] process == long live connections? [10:02:07] even without queries I mean [10:02:13] "think" is too strong [10:02:18] suspect [10:02:19] have fun with this comment: https://github.com/MariaDB/server/blob/2783fc7d14bc8ad16acfeb509d3b19615023f47a/sql/threadpool_unix.cc#L830 [10:04:18] my "scientific" approach would be: setting slightly different values on several servers, observe error rate [10:04:53] if you think it is not worth it, I would be ok with it, just puppetize the current value [10:05:31] what I would like to do <> what I can actually do with my time [10:06:06] that works on 3 different instances, is not like we tried once [10:06:52] I agree, like you I'm unsure too about thread_pool_max_threads in case of connection peak, I was trying this scenario locally [10:07:56] a comment on puppet would solve the issue - link to the db1041 (?) initial issue and say they may need review in the future [10:08:05] ok :) [10:08:07] apply to all [10:08:11] let's move on [10:08:16] for T134476 probably is <= db1050 ;) [10:08:17] T134476: Decomission old coredb machines (>=db1050) - https://phabricator.wikimedia.org/T134476 [10:08:28] ups [10:08:29] lol [10:29:41] labsdb1004 has crashed [10:29:57] and deleted its replication filters in the process [10:34:14] /opt/wmf-mariadb10/bin/mysqld: Normal shutdown , volans? [10:34:39] checking [10:35:20] no need to check, I assume then it wasn't you [10:35:56] nope [10:36:34] akosiaris, did you? [10:37:04] jynus: I rebooted the box indeed [10:37:10] but crash ? no [10:37:10] arg [10:37:22] I should not have I assume from your reaction [10:37:24] rebooting without stopping it is like crashing it [10:37:40] it only took me 6 months to setup that, no big worries [10:38:07] er, stopping mysql is part of the shutdown process since times immemorial [10:38:25] actually stopping all services is part of the shutdown process [10:38:35] so not sure where am I at faul here [10:38:36] yes, but it it timeouts, it kills it [10:38:51] yes, why would it timeout then ? [10:38:53] uncleanly [10:39:12] stop slave taking too much time because horrible labs queries [10:39:21] *tools [10:39:40] and actually is not true for our package IIRC, we don't make it start at boot and I guess either stop, let me check [10:39:40] I have no problem, but where is the log for the restart? [10:40:01] ah, indeed I never logged it [10:41:10] find /etc/rc?.d -iname "*mysql*" <--- empty result [10:41:24] no issue normally in production, but tools labs has horrible long running queries [10:42:05] I thought that box was a slave box [10:42:17] I would not expect it to take that long to do a stop lave [10:42:20] slave* [10:42:30] yes, that is the problem- it has to replicate those horriby long queries [10:42:49] and as also it is not production, users use the unsafe myisam [10:43:01] which is not transactional [10:43:19] and on unclean restart, it gets desyncd [10:43:31] sigh [10:43:39] ok, what should I do to fix the problem ? [10:44:36] nothing, I have now to slowly reimport broken dbs [10:44:52] you are joking, right ? [10:44:57] no [10:45:06] a simple reboot and we have to reimport the dataset ? [10:45:13] welcome to labs [10:45:14] ok that is not maintainable [10:45:29] and myisam [10:46:27] Could not execute Delete_rows_v1 event on table s51412__data.dewiki_templatedata; Can't find record in 'dewiki_templatedata', Error_code: 1032; handler error HA_ERR_END_OF_FILE; the event's master log log.091755, end_log_pos 45319726 [10:46:39] oh ffs [10:47:17] no, that is normal [10:47:29] for some reason, filters got deleted [10:47:49] some dbs are banned from replication because we cannot take the load [10:48:43] this is ridiculous [10:48:47] https://phabricator.wikimedia.org/T127164 [10:49:04] we 've had to reboot the entire fleet in January 2015 [10:49:24] imagine if we had to do that today [10:49:30] well, we cannot [10:49:47] that single box would take more time to fix than the rest of the fleet combined [10:49:58] no matter how much paravoid and you want it, we cannot [10:50:05] there is no "we cannot". if we got another GHOST we will do it [10:50:30] ok [10:50:31] getting pwned is not exactly an option [10:55:11] akosiaris: if I may add my 2 cents... it has to be something that can be exploited somehow from the outside on an internal service like mysql to required that [10:55:53] s/required/require/ [10:56:13] volans: granted, but orgs get pwned incrementally. One simple thing at a time [10:56:20] I do not disagree at all with wanting that [11:00:00] it's the practice of a fully stateful service that is a bit more complex, when restarting mysql you have to wait at least for %25 of the buffer pool to be reloaded into memory to start putting it back into production, better if ~50% [11:00:11] on larger RAM it takes hours [11:01:58] what seems to be the issue? [11:02:54] no, issue, some db replication issue that happens all the time, and not user facing [11:03:05] will fix it ASAP [11:03:11] what is it that I want and cannot happen? [11:03:28] reboot any db host at any time [11:03:39] that currently requires care [11:03:40] I don't think I have ever been that unreasonable [11:04:36] emergencies do happen though, and sometimes we have to reboot hosts on a short-notice [11:04:56] (or replace their certs, which is what I think you're referring to) [11:05:10] and being as prepared as possible for that should be a priority [11:05:35] yes [11:06:06] I never disagree with that [11:06:20] glad to see we're on the same page :) [11:06:26] actually, I am the one that thinks that we should be able to crash any server at any time [11:06:36] not reboot, crash [11:06:59] yeah, that'd be great as well, but I'd take scheduled-with-short-notice for starters [11:07:15] maybe even log it? [11:07:25] log what? [11:07:30] a server reboot [11:07:44] do you agree that is a good policy? [11:11:39] look, replication is running again, end of the story [11:19:59] you really should be more receptive to those kind of things [11:20:53] receptive to not having hugly critical SPOF? [11:20:53] responding to your colleagues' legitimate requests with "no matter how much you want it, it cannot happen" and "I will ask for a budget to several vendors to implement this functionality, (or should we prefer to implement it in-house?)" [11:22:30] you should be more receptive, explain the challenges, seek input on how to solve them and provide your version of the roadmap for when such a thing can happen [11:22:50] otherwise, it's not very productive and just frustrates the rest of us [11:23:00] paravoid, it doesn't matter, last time I did that I was completelly ignored [11:23:32] what kind of response is that? [11:26:48] that any time I try to give feedback I got shutdown [11:27:08] by you, mostly [11:53:25] well, first of, I'd love to hear that kind of feedback from you [11:53:40] I'd encourage you to put it in a peer review, but unfortunately due to stupid HR reasons I can't be part of that [11:53:51] so I'd take it privately, preferrably to me, or mark if you prefer [11:54:32] I don't feel like I've attempted to shut you down, so clearly there is an impedance mismatch somewhere (not saying that I'm not at fault) [11:55:06] I sent a thing I wanted to talk yesterday to mark to include it on the agenda [11:55:23] it is ok that it didn't make it (too little time) [11:55:31] yesterday's agenda you mean? [11:55:33] but it wasn't even on the agenda [11:55:45] difficult to feel represented [11:56:12] which agenda? codfw-rollout or the ops meeting? [11:56:20] and you knew about it because I mentioned it by name and text on ops meeting [11:56:55] I mentioned on ops, wanted it to be included on rollout [11:57:20] you've made quite a few mental jumps here [11:57:21] in automation, you only wrote about mysql, when it is an ops-wide problem [11:57:42] can I respond to each of those points? [11:57:56] sure, but let's move in private [11:58:00] it is offtopic here [11:58:04] good idea [12:37:18] jynus: thanks for fixing labsdb1004 replication and sorry for breaking it. [12:44:02] no need for that [13:52:39] db1011 crashed again [13:53:04] so nice of it [13:53:34] and I would bet that labsdb1003 has load issues again [13:53:51] btw we should really change the way mysql alarms are made, I think that those events deserve an email, we cannot just rely on us checking tendril IMHO [14:04:55] FYI: I've added https://wikitech.wikimedia.org/wiki/MariaDB#Salt for the new grains [14:43:03] thanks, volans [14:43:25] I suppose you agree with joe about having master on all masters, right? [14:43:47] yes, I've opened a task as subtask of improving the switchover to refactor that [14:43:56] I agree too [14:44:04] but I have 1 fear [14:44:10] and I've just sent CR 287088 to set es1 as 'standalone', they are not slave, if you agree [14:44:21] for the grain mysql_role [14:44:22] yes, that was a good call [14:44:31] I haven't thought about that [14:44:46] as a user I assume that on a role:slave I can do slave status/stop/start [14:45:08] one thing I fear is that we do a datacenter failover "by accident" (automatic) [14:45:31] and without proper care, puppet starts 2 instances of pt-heartbeat [14:45:50] questions is, would that be so bad? what is the worse that it could happen? [14:46:02] if no care was taken [14:46:44] I think that cross-dc replication could break, that it is not that bad right? [14:46:49] as of now that MW is not using that not much, I guess worse case scenario icinga will check the wrong side of the circular replication [14:47:15] but we can verify that manually to ensure is correct [14:47:18] yes, but that would normally cause only a certain +0.X seconds on repl lag [14:47:51] maybe we could change the tool to replicate without colisions [14:48:00] actually would it collide at all? [14:48:06] the master id are different [14:48:07] because shard is not a PK [14:48:10] true [14:48:19] depends on the where done by who uses it [14:48:23] the write I think are safe [14:48:40] I think that is "safe" (as it is the worse case scenario [14:48:51] and gives me an idea of how to make it safer [14:49:05] we could add a "site" to pt-heartbeat [14:49:12] why should pt-heartbeat be run from the local master? [14:49:28] running it from a single place would make sure the transition is direct [14:49:47] either it points to A or to B [14:49:52] but not both [14:50:27] and the do that from several places as a failover [14:50:47] I am thinking aloud and saying nonsense [14:50:53] thing is, go ahead [14:51:27] we will solve the master switch outside of puppet in a different way later [14:52:46] technically, if when we switchover we don't touch pt-h at all, it should still work [14:53:03] with just the addition of eqiad->codfw replication for the hearbeat table [14:53:27] but it can be without running for as long as 30 minutes [14:53:57] if puppet runs on master E, but takes 30 minutes to run on master C [14:54:18] ok for alerts (they can be silenced) [14:54:35] not ok if mediawiki uses that for load-balancing [14:55:09] we could change puppet to shutdown pt-heartbeat only if there is another entry in the DB for the same shard updated in the last minute and it's on the passive datacenter [14:55:30] means that it was already started on the active one [14:55:33] uh, puppet reading db, scary [14:55:39] ehehehe :) [14:55:40] I agree with the idea [14:55:48] not with doing it on puppet :-) [14:56:00] +1 [14:56:04] sorry, I am putting buts [14:56:12] and not doing anything productive [14:56:17] just go on with the change [14:56:55] no problem, ok :) [14:56:58] it is just that I was scared at first on doing things (outside of alerts, features) depending on ::active_dc [14:57:39] mostly because it takes 20 minutes to perform an actual failover [15:00:47] if you have time, give a look at https://gerrit.wikimedia.org/r/#/c/286858/ it is cleanup, but WIP [15:02:08] ok [15:03:00] low priority [15:03:11] with an extra of skip-slave-start [15:03:42] ++ for that [15:03:55] :-) [15:20:36] volans, I found oyu [15:21:04] Hi Cyberpower678 [15:21:12] wm-bot3 was very helpful in that [15:23:45] volans, I'm ready to test\ [15:24:30] let me see in what shape tool db is [15:24:46] (are) [15:27:22] Cyberpower678: go ahead, it's in a stable state from graphs, so should be easier to spot changes [15:28:04] They should start up within a minute [15:28:26] opening taables... [15:29:08] https://grafana.wikimedia.org/dashboard/db/labs-project-board?var-project=cyberbot&var-server=All [15:29:20] The bottom graphs should start to go up [15:29:42] I can see the load on the DB, is quicker to go up... [15:29:51] and is going up...... [15:30:47] Cyberpower678: they do something more "heavier" at the start? [15:31:10] they= the workers [15:31:16] When they start up, they test for the existence of the DB and the table [15:32:19] apart that, query wise, right now are already in the standard usage? [15:32:36] Looks like it [15:34:00] SHOW PROCESSLIST is definitely shorter than it was last time all the workers were running. [15:34:22] don't get me wrong, but I don't think that this is sustainable at all... let's leave it few minutes to gather some data points on the stats too [15:34:42] Can I see the graphs? [15:35:52] I'm looking inside the machine now :) [15:36:08] What's hitting them the hardest? [15:36:12] Which queries? [15:36:24] SELECT? INSERT? UPDATE? DELETE? [15:37:17] https://grafana.wikimedia.org/dashboard/db/server-board?from=1462460795486&to=1462462355486&var-server=labsdb1005&var-network=eth0 [15:37:43] wow, that is a 5-> 20 spike [15:38:41] volans, you have per-user metrics on information_schema.*_statistics [15:40:27] volans, do you have any ideas which queries are being the most stressful? [15:41:39] which is your user/db? [15:41:55] something_cyberbot [15:42:10] 423778691, s51059, 10.68.23.85:57737, s51059__cyberbot, Sleep, 1, , , 0.000 [15:42:18] for starters, you should not query INFORMATION_SCHEMA.TABLES, that usually causes stress [15:42:34] Damn SQL workbench [15:42:45] BUT I see lots of slow "SELECT * FROM externallinks_enwiki WHERE `pageid` " [15:43:00] taking up to 40 seconds [15:43:14] What? [15:43:18] which probably means there is not an index there (full table scane every time) [15:43:21] Those shouldn't be slow at all. [15:43:29] There should be indexes in there [15:43:37] the user fits, "s51059__cyberbot" [15:44:14] those 2 are in the top 3 slowest queries of the server in the last hour [15:44:39] but it is not the full picture, just a quick look at slow queries [15:45:03] I think the quantity of them is what affects it most [15:45:21] I see also a 33-second UPDATE on that table [15:45:21] pageid is indexed [15:45:30] I mean, not all indexes fits in memory, and I/O waits are really high [15:46:03] Does that mean my table is over indexed? [15:46:53] volans, I would copy https://tendril.wikimedia.org/report/slow_queries?host=labsdb1005&user=s51059&schema=&qmode=eq&query=&hours=1 for him to check [15:47:03] assuming there is nothing private there [15:48:17] ok, make sense [15:48:27] also check user_statistics and give a comparison with other users [15:48:42] write heavy/read heavy, etc. [15:51:10] Still there? [15:52:24] guys? [15:52:33] I have to leave in a few minutes [15:53:56] sorry, I was checking also other stuff [15:55:20] as quick question, is not possible to run with half of the workers, my guess so far is that there is too much concurrency [15:55:50] It will only handle half of the alphabet [15:56:41] The problem is, when this gets deployed on other wikis in the forseeable future, this problem will come up again. [15:56:45] 2 letters per worker? [15:57:06] I'm not sure how to tell the API that. [15:57:33] You were also told by one of our senior developers that the concurrency used was too high [15:57:35] if this has to grow further it surely needs some dedicated hardware on the DB side [15:57:47] but yes, that would be another solution [15:58:21] the main issue is that toolsdb is a shared environment [15:58:23] jynus, who told me that? [15:58:47] I recall legoktm being concerned about the API being hammered. [15:58:58] https://phabricator.wikimedia.org/T131937#2183955 [15:59:00] yes [15:59:01] But it hardly uses the APO/ [15:59:03] that I mentioned [15:59:04] *API [16:00:28] volans, I'm not sure how to use apprefix to have it return A and B [16:00:56] If I typed in AB it would probably try to return a list that starts with AB [16:02:11] sorry I'm not familiar with our APIs to be able to help here, the simple way could be to have a worker that does one call for A and one for B sequentially, so basically is not doing A and B in parallel [16:02:19] so the API call will not change [16:02:58] Well, I'll have to figure out other solutions. [16:03:20] But the DB will definitely need to be moved in the future. [16:03:42] And sooner rather than later. [16:03:57] I'm sending you the queries so you can take a look [16:04:14] did you stop it? [16:04:59] Yes [16:05:10] I wasn't going to leave with them on. [16:05:29] The workers have been reduced to 2 again [16:05:51] ok, thanks for understanding [16:06:10] volans, I would appreciate some advice on proper indexing. [16:06:28] Especially for that table. You mentioned me having redundant indexes. [16:07:07] If you could tell me what indexes I could drop to maintain proper indexing that would be great. I would imagine it would save space. [16:07:23] the redundant one will only save space, not performances, if not used are not loaded into memory [16:07:43] All the columns that are indexed at current should remain indexed. [16:07:59] Gotta go [19:04:02] volans/jynus....new db servers 4 per row?