[07:55:47] morning [07:59:55] hi there [08:00:46] I'm catching up with some emails, I'll take a second look at db2047 as you asked in the task before repooling it [08:01:03] no problem, no hurry with that [08:01:12] in fact, nothing ongoing right now [08:01:47] [08:02:19] :) [08:54:49] when you have some time, let's organize/prioritize pending work [08:57:27] sure I was about to ping you, looking at opening things assigned to me :) [08:58:14] let me do a sanity check after I applied merged a commit to holmium [08:58:22] so that there is nothing broken [09:01:20] I am losing it: "collation_server = utf8" [09:02:11] what/where? [09:02:31] I [09:02:52] *I* am losing it, writing collation =utf8, it is not on production anywhere [09:02:57] :-) [09:03:28] too much time with bin [09:03:42] eheheh true :) [09:04:17] I would actually use real utf8, but I think dns requires 3-byte utf8 [09:09:28] this one is worse: "caracter_set_filesystem=utf8" [09:10:26] lol [09:16:14] and other instance of mysql is running [09:16:19] so back here [09:16:46] ok [09:18:03] ok, so things to do [09:18:13] I've reimaged db1052 [09:18:30] I've backed up its data and then upgraded it [09:18:40] seems to be working fine [09:19:14] we need to do that with the other 5 instances [09:19:16] good [09:19:40] we need to focus s3 on the 3 larger instances (slowly) [09:19:57] to increase their weight? [09:20:02] and "decom/reassign" the others eventually [09:20:10] yes, starting by that [09:20:38] we need to do pending work with Matt, Andrew and Nuria [09:20:52] I have dome some for the first 2 [09:21:20] there is also the semi-sync configuration to be fixed, I've took a look this morning [09:21:29] labsdb1003 is under high stress, killing some long running queries helped: [09:21:33] https://phabricator.wikimedia.org/T133705 [09:21:45] + I reduced pt-kill time to 2 hours [09:21:58] ok [09:22:13] it may need more checks/rebalance/throttle [09:22:49] https://grafana.wikimedia.org/dashboard/db/server-board?from=1461727365741&to=1461748725741&var-server=labsdb1003&var-network=eth0 [09:23:09] I've created some extra tickets [09:23:36] I may suggest stopping moving the master of dbstore1001 until GTID: [09:23:48] https://phabricator.wikimedia.org/T133386 [09:23:53] https://phabricator.wikimedia.org/T133385 [09:24:03] but GTID is not yet a priority, so that can change [09:24:16] I am not sure [09:24:34] not having to reimage slaves crashing is a hufe advantage [09:25:28] absolutely [09:25:29] pt-heartbeat things has to be done, but realisticly it is not a priority now [09:26:30] labs reimports have to continue to put labsdb1008 ASAP into prouduction [09:26:43] but the reimport is failing, have to investigate why [09:27:20] TLS has to be continued to be deployed and a general check has to be done [09:27:35] (I sent an email to ops-, I hope you saw it) [09:27:41] there is some pending schema changes [09:27:47] yes of course :) [09:27:57] that were blocked by the swithcover [09:28:36] eventolgging is broken, but sadly there is a blocker on m4 work/schema changes [09:29:00] they got rid of autoincrement primary keys and now the script is basically broken [09:29:40] great :( sorry for the quick revert the other day, but I was not in a position to be able to do a proper investigation/fix [09:50:02] ok, so what else [09:50:55] there is 250 tickets on DBA, and a few others on blocked-on-schema-change [09:51:32] dbstore space, es2019, dbstore2002 replica [09:51:34] etc :) [09:51:37] I've put on next what I believe is our priorities [09:51:45] es2019 see the last updates [09:51:54] dbstore2002, yes [09:52:18] es2019 there is not much to do, unless you want to restart replication [09:52:40] at this point (2 crashes) I would reimage it/import it form 0 [09:53:15] not only because mysql, filesystem, etc [09:53:29] there is no hardawre issue there? [09:53:38] papaul found nothing [09:53:46] I specifically asked him [09:53:51] because it looks like it, nothing on the logs of the management card? [09:54:00] yeah, saw the task [09:54:13] so, wait now, return if it crashes again [09:54:40] ok [09:54:47] feel free to disagree if you have a better plan, I don't [09:55:47] I will take a quick look if I find the management card logs and see if anything there [09:56:15] ok with that, assign it to you [09:57:16] if nothing to justify the return we could run some stress test for a while to see if it crashes in a more predictable way under stress [09:57:23] yes [09:57:30] I though about that, didn't remember [09:57:44] because in both cases it failed when under load, more or less [09:57:57] but it only happened twice, so there is not a strong correlation [09:59:28] TLS, are you doing something? [09:59:45] I would put in on backlog if not [10:00:11] although you may want to check semisync and ssl status at the same time [10:00:13] your call [10:00:42] codfw slaves needs to be restarted, right? [10:01:04] not as a huge priority, but eventually, don't they? [10:25:14] yes most of them need to be restarted, not that urgent either [10:25:50] x1-slave need reimage too, I think there is a task for that [10:26:20] T112079 [10:26:20] T112079: Upgrade x1 cluster - https://phabricator.wikimedia.org/T112079 [10:26:28] that one [10:26:46] "Normal" yet [10:28:01] BTW [10:28:21] when I move somthing in progress to Next, I do not bother unclaiming [10:28:41] ok but I assume if not in progress that is claimable [10:28:44] because I may be working myself on that, but feel free to claim those too [10:28:47] exactly [10:29:16] also because some people thing that if it is not claimed nobody will work on it anymore [11:21:43] FYI https://phabricator.wikimedia.org/T108856#2242503 [11:22:13] (I do not want you to do anything, but be aware of schema changes being done on that custom replication, which means things will break) [11:23:10] ok, thx [14:32:45] funny https://phabricator.wikimedia.org/T133780 [14:33:04] you discovered the issue with double filters, then [14:33:21] yeah :) found while trying semi-sync options in my.cnf on a local VM [14:33:26] (those accept multiple otpions) [14:33:38] it should be easy to fix, though [14:33:41] the replicate do... true [14:33:53] I wonder why it was created like that [14:34:20] the default mariadb one just creates the classic /etc/mysql/my.cnf one [14:34:35] no, I mean the link [14:34:47] maybe to makse sure it was deleted? [14:35:04] and it created another issue by itself? [14:35:29] I mean the official MariaDB deb package [14:35:43] the one on jessie, I've another VM with that [14:35:52] are you sure it is the package and not puppet? [14:36:20] actually not... let me check [14:36:36] in any case, it needs fixing [14:36:45] true, it's puppet [14:36:47] if it is puppet, however, it will be easier [14:36:55] that manager both [14:37:08] I'll send a patch [14:37:13] so a check with salt + patch [14:37:35] we should check just in case, all servers to check we do not overwrite anything [14:37:59] eg 5.5 servers with default package, etc. [14:38:14] 5.5 may install /mysql/my.cnf [14:38:32] it's in each class, so we could fix mariadb::config only for now [14:38:32] and while we want to kill that, not blindly [14:38:46] and not legacy ones [14:38:59] I've updated the task description [14:39:10] still, I would do a check (and I am not telling you to do it, just to do it before applying the patch) [14:39:34] but good catch [14:40:00] I think the default config also has an include [14:40:47] not ours, ubuntu/debian one [14:40:50] usually the one for stuff in conf.d/ [14:40:53] ye [14:40:55] s [14:41:27] I wonder if we should do proper puppet, or avoid it enterily [14:41:44] I think the current model is safer [14:42:17] but again, I want your input on it [14:42:51] really good catch [14:42:55] this is more explicit, we load everything from a single file, the other option could be to have our puppet create a .cnf in conf.d/ and get it loaded [14:43:13] my fear is some boxes have legacy config (not purged) [14:43:26] that way we avoid ambiguity [14:43:44] I think conf.d is ok for a non-dedicated machine [14:44:03] but we want it a single file for dedicated machines [14:44:12] easier to debug and maintain [14:44:17] +1 [14:44:27] (except for this error) [14:44:30] :-) [14:44:54] just loaded once, not twice :) [14:45:21] you have caught an error the previous 2 dbas did not cought [14:45:45] lol [14:46:02] let me do another check, becuase I know why I said was the package in the first place [14:46:15] I remember the double filter load, asking why and being told "do not worry about it, probably a mariadb bug" [14:46:34] and did not investigate further [14:47:46] ok, it's definitely puppet, I was mistaken saying that was the package becaue in my VM there is no puppet, but I actually created the link manually when I created it to make it like our prod ones [14:47:58] :-) [14:48:16] I asked because the package is not customized at all [14:48:33] compile + packaged, with no pre/post, etc. [14:49:00] once we get rid of rubish from 5.5, we could make a proper package [14:50:46] :-) [14:51:35] for now ignore my last CR for semi-sync, I though was a good solution but there is one case I don't like, checking it now [14:52:09] whenever you are satidfied, feel free to ping me [14:52:22] re: https://phabricator.wikimedia.org/T133588#2237441 [14:52:56] I didn't know about that specifically, but I saw the name on that table and that scared me- there could be a charset issue with table names [14:53:17] which is why I enabled logging [14:53:43] I may have a second look at that, as I am currently working on m4 database [14:54:18] what made me assume that is a corrupted name is that all the others are MobileWebClickTracking_* [14:54:48] but could be just a charset config issue [14:55:07] actually, I think the table is like that on the master [14:55:22] I saw the same on the master too, yes [14:55:24] a different thing is if its name gets corrupted in the way [14:55:31] yeah, same thinking [14:55:47] I can take over that and at least solve the script failing [14:56:05] (without promising actual improvement) [15:00:18] the easiest explanation that I have is that when the table was created on disk (.frm is from Dec 16 2014) that server had an issue/crashed [15:00:33] it could happened to the master at that time or to a slave that afterwards become master [15:01:44] I have a better explanation: those tables are dynamically created, someone did a bad thing(TM) [15:01:56] dynanically == created by the application [15:02:17] lol [15:02:27] then the "replication process" creates it as a clone [15:02:37] it could have also happenede there, as you said [15:04:49] interesting on dbstore2002 [15:05:02] -rw-rw---- 1 mysql mysql 7072 Jun 4 2015 MobileV@fffdb@fffdlickTracking_5929948.frm [15:05:06] -rw-rw---- 1 mysql mysql 3291 Dec 16 13:12 MobileWebClickTracking_5929948.frm [15:05:47] so replication-related, probably [15:06:11] I mean on dbstore2002 there is only the frm with the wrong name, .MAD and .MAY have the right one [15:06:17] will followup with the devels, have to talk other things (do not let me distract you) [15:06:40] ok [15:11:59] back to my CR for semi-sync, it's the most clean way I've found (can be just a bit cleaner with 'plugin_load_add') but has a catch, for the servers where the plugin was installed with INSTALL PLUGIN (hence saved in mysql.plugin table) it will throws ERROR Plugin 'rpl_semi_sync_master' already installed at startup. [15:12:07] what do you think? https://gerrit.wikimedia.org/r/#/c/285649/1 [15:14:06] no, put an if there [15:14:23] if it's for $master... it's not passed to the class :( [15:14:30] ? [15:14:39] if you want the if $master for the master plugin [15:14:57] is not among the parameters that are available there [15:15:05] well, the class should have a new param [15:15:35] semysync= 'off'|'master'|'slave'|'both' [15:15:53] * volans reminds last time he tried to add a parameter :) [15:15:55] based on master + current active datacenter or something [15:16:01] yeah, but to the class [15:16:05] the main class [15:16:08] not the config [15:16:27] we will not add more params to the top class [15:16:35] only to the config [15:16:45] ah ok the mariadb::core you want it more clean, mariadb::config can have additional ones [15:17:00] eventually, we want more params on core too [15:17:11] but with a more clean approach [15:17:55] what I do not like is loading 2 plugins on all servers [15:18:12] when they are nor going to be used in most cases [15:18:21] and I think it has a penalty [15:18:33] true, but in the case of a failover? who will remember to load it? [15:18:39] puppet [15:18:58] and you will say, but we already have problems with pupet dynamic config [15:19:01] true [15:19:17] this will be solved in the same way :-) [15:19:32] ah, you do not know what is this misterious new way? [15:19:37] ahahah [15:19:44] neither do I [15:20:00] but puppet still has to provide a proper config on reboot [15:20:03] no matter wat [15:20:16] "reflect the current state" [15:20:24] phyilosophy [15:20:42] we will later think about how to orchestrate it [15:21:08] ehehe [15:21:34] it is true, a monitoring process diffing my.cnf and live is as easy as pt--config-diff [15:21:48] also scary [15:21:58] /etc/my.cnf h=localhost [15:22:05] monitoring? not [15:22:14] automatic orchestration? yes [15:22:22] that is why we do not have a solution yet [15:22:54] do not make them loose, anyway [15:23:37] or do you prefer it not to fail? [15:24:12] given that we restart mysql manually fail is also an option, then you'll have to comment all 6 lines instead of 1 [15:24:27] I actually prefer it failing [15:24:43] what are reasons it to fail? not having the plugin at all? [15:24:54] I can live with that [15:25:14] I agree with you re-thinking about it [15:25:24] look at the tokudb [15:25:27] nodes [15:25:34] we do loading of tokudb there, too [15:25:42] check how it goes [15:26:01] if it fails, it is more problematic, of course, so not really the same issue [15:26:26] ok. And for the error message at restart for the ones that have it already installed? is it acceptable? I don't like it too much but I don't know a way to do it without dynamic orchestration [15:26:40] ? [15:27:00] like compiled statically? [15:27:05] do we have those? [15:27:32] (I am asking, we should find out) [15:27:33] no, just installed at runtime with INSTALL PLUGIN [15:28:06] I think I do not understand, why would it fail? [15:29:03] the load plugin tell mysql to load the plugin at boot, but doesn't write to mysql.plugin, so if you comment it out in the my.cnf and restart it doesn't get loaded [15:29:27] instead if you have a mysql and do INSTALL PLUGIN, this get saved into mysql.plugin and loaded at restart [15:29:45] so having both causes an error message that is already loaded, it works fine [15:30:02] ahhhh [15:30:14] but I don't like to see an ERROR at startup that is "expected" [15:30:49] why not having it on config and unloaded? [15:30:56] not now, obviously [15:31:00] on stop [15:31:18] the how is a good question [15:31:29] that's the goal, we should have something that on stop will uninstall it and then will get loaded by the config [15:31:33] I am ok with loose [15:31:43] if you plan to have monitoring later [15:31:53] if you do, ok [15:32:26] that would solve any potential issue [15:34:10] plan doesn't necessarily means doing it now, just document it on the patch "doing this because I assume that" [15:34:54] but I am more worried about the other thing, the load line [15:35:00] I've seen that for tokudb we do the same without loose, I think it's ok to fail, and shouldn't given that the semi-sync plugin is shipped by default [15:35:49] or just comment "# plugin load, if it fails, do this bla bla" [15:36:03] so that non-DBAs have an idea what would be going on [15:36:12] ok, make sense [15:36:23] the load line is [15:36:26] 160427 16:50:26 [ERROR] Plugin 'rpl_semi_sync_slave' already installed [15:36:30] my issue is always people outsid of here [15:36:33] 160427 16:50:26 [Note] Semi-sync replication initialized for transactions. [15:36:52] we need to provide them a level of abstraction [15:58:54] if that error does not prevent server execution, I wouldn't care much [16:00:29] everything works fine and rpl_* variables are loaded correctly [16:06:08] then I wouldn't care much, putting a heads up on wikitech/comment at most [16:06:40] o [16:06:42] ok [16:27:14] if you're ok with https://gerrit.wikimedia.org/r/#/c/285664/1 I'll merge it and send the other CR with the actual changes [16:30:06] I was thinking ,that is like the first step, right, to later enable it on demand? [16:30:35] ? [16:30:56] it does not apply to any server yet? [16:33:04] checking:285649 [16:33:11] that's just a variable, does absolutely nothing because it's not used, then I'll send the other CR where on production.my.cnf.erb will be used to set it up and in mariadb.pp inside the mariadb::core I set it to master/slave based on the $master variable [16:33:26] yes, not need for me to review [16:33:30] (I did) [16:33:37] I was already thinking ahead [16:33:49] for the other one, give me a sec, I just need to update it [16:33:53] yes [16:33:56] sorry [16:34:25] (I am with other things and it takes me a while to come back to that) [16:35:08] no, sorry to distract you [16:41:59] I am not 100% sure about the algorithm here https://gerrit.wikimedia.org/r/#/c/285649/2/manifests/role/mariadb.pp [16:42:08] but I do not have like a clear alternative [16:42:57] I would not enable it, however for cross-datacenter or parsercache [16:43:49] agree, let me see what I can do. The problem there is that codfw masters have $master=false... [16:43:57] 2 options [16:44:11] have master=true and check read only on current active datacenter [16:44:26] have the current datacenter check on slave, so it is off [16:45:00] you have $replication_is_critical = ($::mw_primary == $::site) below [16:45:24] but again, I do not have a clear mind about that [16:46:03] let me try something to improve it [16:48:30] parsercache are not coredbs, so forget about that [16:48:58] yeah, was checking the same [16:49:35] in theory we could have another "level" of decision, because the plugin can be loaded but set to OFF too [16:49:48] but I'll avoid it makes only confusion [16:50:04] no reason, I think [16:50:18] well [16:50:29] maybe some slaves? [16:50:53] that we do not want to block on them? But makes no sense, we can allways set it to off [16:50:54] load the plugin always and keep it off for codfw for example [16:51:21] no reason, loading the plugin and enabling it has the same impact [16:51:47] it is not like you would need to restart, in which we would enable it everywhere [16:52:00] (*load it) [16:52:14] so that it's only a matter of SET GLOBAL and not INSTALL PLUGIN that will then make is generate the ERROR in the log at start [16:52:56] too focused on logs' false positives, I am more worried about opertional errors [16:53:07] I will let you complain when we have the logs on kibana [16:53:14] not yet :-) [16:53:22] but in any case we need orchestration, because set it ON on a slave is useless until you do stop slave IO_THREAD; start slave IO_THREAD; [16:53:30] eheheh [16:53:32] true [16:53:36] needs stop [16:56:02] slightly improved: https://gerrit.wikimedia.org/r/#/c/285649/3/manifests/role/mariadb.pp [16:57:45] * volans brb [17:03:36] do you want it for es2/3 too? (it's now a separate my.cnf) [17:03:58] probably yes [17:06:02] we should only deploy it before starting a server, though [17:06:09] to check it live [17:07:11] either a s* reimage or a depooled slave for TLS or something [17:07:25] I can do it tomorrow on codfw restarting a slave for TLS [17:08:10] but codfw isn't enabled :-( [17:08:34] * volans facepalm [17:09:34] .16.33 errors spikes (is it db1044?) either api abuse/lack of optimization or something else [17:10:07] I am not concerned about the change at all, but the last thing I want is to have issues on a emergency restart [17:10:22] of course [17:10:53] we should prioritize more T133780 [17:11:01] because of the same issue [17:11:16] T133780: Multiple Puppet class make MySQL load /etc/my.cnf twice - https://phabricator.wikimedia.org/T133780 [17:11:31] (and maybe that was the reason why it wasn't puppetized in the first place) [17:11:40] that ,and manual tuning, I suppose [17:12:16] under stress (pt-online-schema-change), I had to manually enable/disable semisync to avoid lag [17:12:39] do you want to leave /etc/mysql/my.cnf or /etc/my.cnf ? [17:13:27] I like /etc/my.cnf more, even it the other would be more "debian correct" [17:13:33] less typing [17:14:37] I would do a puppet ensure deleted, then delete completely [17:14:53] with some quick checks first [17:15:10] delete only if it's a link to /etc/my.cnf I guess should be safe [17:15:17] true [17:15:53] can puppet do that easily? [17:16:44] no idea, hopefully yes, I'm checking :-) [17:18:59] meanwhile I'm running the puppet compiler for the other change [17:29:36] doens't seems so, but, given that we have ensure => link with a target, Puppet should already ensuring that it's a symlink to that target, so a simple absent should be enough [17:30:48] ok [17:49:45] ok, I gotta go, I might reconnect later to check puppet compiler runs, if not later I'll see you tomorrow [17:58:29] bye!