[11:37:03] for when you came back, one reason why allowing lag during maintenance is not a big deal is because in most cases our bottleneck is in reads, not writes, so most slaves catch up nicely [14:45:13] Re: db1048 there was a phabricator upgrade the other day, so the slave probably was affected by that [16:12:01] ok, thanks (back, really late sorry) [16:27:14] so what is the task for es data migration? [16:29:27] I was thinking to create a subtask from T126006 for each server [16:30:07] no need one task per server, just create one, and comment which ones you are doing to not overlap [16:31:08] there is a syntax for cehckboxes "[X] Server 1 / [] Server 2" [16:31:29] ok [16:31:54] if you didn't already started with any of them I will start with es2001 -> es2011 [16:32:09] ok [16:40:39] https://phabricator.wikimedia.org/T127330 [16:42:26] good [16:43:35] once you start migrating data, think what interests you more of: https://phabricator.wikimedia.org/tag/dba/ [16:43:47] if you do not know, I can give you several suggestions [16:44:16] ok, I'll let you know :) [16:44:42] I am saying because most of the process os migrating is "waiting" [16:49:44] give a look to puppet, that part is more interesting than a pure sending data away [16:51:52] ok, thanks [16:51:59] btw, what about es2003? [16:52:21] 4 -> 3 servers, one cannot be preserved [16:52:42] why the 3th? pure randomnes [16:52:59] thinking that the 1st or the second will have more chances to be right [16:53:02] no reason at all [16:53:36] lol [16:54:00] those are my own manias [16:54:16] the only important thing is checking that they end up on different rows [16:54:30] do you have access to rackspace? [16:54:41] sorry [16:54:46] meant racktables [16:55:24] (probably not unless you have asked for it) [16:57:47] I didn't know it so no :) [16:58:08] but I've seen your schema in the ticket [16:58:16] I think we have to ask robh for that [17:12:02] so if you search es2011 on the search bar [17:12:13] you will get an idea of the pysical location [17:12:57] no actionable here, really, but nice to have/double check [17:15:03] also nice to quickly check models, warranty, etc. [17:15:36] ok, I'll take a look around to familiarize a bit once the data is copying [17:15:41] https://gerrit.wikimedia.org/r/#/c/271560/1 for you [17:15:49] looking [17:16:04] I thought to add all of them, and sorry for the git diff the first 2 blocks look ugly [17:16:07] dunno why :) [17:16:17] they should be like the third one [17:16:18] thats normal, silly diff [17:18:04] I'm not fully sure that that's enough [17:18:08] there is a lot of clutter that we will get rid of soon [17:18:28] like the p_s on and others that should be vy default [17:19:26] ah, one trick [17:19:52] sometimes, the logic is easy and the only thing to check is the syntax [17:20:55] for that, joe created our puppet compiler [17:21:09] nice for a quick sanity check [17:21:20] let me show you, even in this case is a bit of overhead [17:21:25] cool, I have configured puppet-linter on my editor [17:21:37] yes, that is great, too [17:21:57] this goes a bit further, but not totally, as it cannot apply the changes [17:22:20] yes, pure syntax [17:22:39] I think this is it: https://integration.wikimedia.org/ci/login?from=%2Fci%2Fjob%2Foperations-puppet-catalog-compiler [17:22:53] login, as usual [17:23:15] build with parameters, put your gerrit commit there [17:23:44] and a list of full qualified domain names [17:24:12] e.g. one that changes per shard, and one that doesn't have to change [17:25:03] https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/1809/console [17:25:20] good, now wait :-) [17:25:45] I know this is a lot of overhead, but I am throwing everithing I can think of [17:25:56] ERROR: Unable to find facts for host es2015.codfw.wmnet, skipping [17:26:12] ah, it doesn't work for non-existing hots [17:26:37] but still, it will be useful in the future [17:26:40] :-) [17:26:56] ahahah let me tell joe [17:27:08] Hosts that have no differences (or compile correctly only with the change) es2001.codfw.wmnet [17:27:13] so that's good [17:27:40] note that this is running on labs, it has lots of limitations [17:32:08] ok [17:32:20] I've the bug file ready if joe confirms :) [17:33:28] well, more of a wishlist :-) [17:34:07] but we will say what I do, happy to get patches! [17:35:43] do we sync each other for puppet merges? [17:36:07] ? [17:36:22] with other ops? [17:36:24] to avoid conflicts yes [17:36:41] well, you are responsible to merge your own patches [17:36:47] the ones you +2 [17:37:09] it asks for confirmation, if you are about to deply more than your commit, ask [17:37:15] ok [17:44:11] the data update that you told me on es1, is a manual thing? [17:44:39] you mean the recompression? [17:44:52] the addition of data from es2/3 [17:44:54] I think it has not been done in 3 years [17:45:03] there is a script [17:45:27] we will test it soon with mediawiki devels, with no rush [17:45:39] just to ensure that it will not run while the old ones in es1 will be depooled for the data migration [17:50:04] no, purely on demand [17:50:16] and if they are depooled, it should not hit at all [17:50:29] follow however the full depool repool cycle [17:50:41] becaue there is ongoing codfw activity right now [17:51:24] (not user traffic, but cirrus reindexing- we do not want to affect any of that) [17:51:55] ok, so for the CR we are good or we need to add them in other places (except the cluster =X) [17:52:16] nom that issue shouln't affect that [17:52:53] only ganglia and specific usages of salt [17:53:01] ok, going ahead with rebase/merge/puppet-merge then [17:55:54] yep, go on [17:56:13] test it as soon as you merge in case a rollback is necessary (it shouldn't) [17:58:04] ok, forcing sudo puppet agent -t on es2011 [18:01:18] in the end every change it is about managing risks [18:01:35] adding new hosts, the worst think that could happen is puppet failing on those [18:01:47] which should have very little impact [18:02:20] if it's syntax though would be worst [18:02:26] I like my puppet being not very intelligent- not starting mysql, not upgrading automatically, etc. [18:02:34] preciselly to avoid those issues [18:02:47] puppet run successfully [18:02:52] great [18:03:20] I don't even start mysql, we'll start it directly with the data [18:03:27] now what it is left is something you have already done [18:03:37] depool! :) [18:03:43] downtime/depool/etc. [18:04:34] Free PE / Size 0 / 0 [18:04:42] mmm [18:04:57] yes, do a sanity check to see that everything is correct [18:05:16] a, you mean the old server? [18:05:18] ha [18:05:25] no es2011... [18:05:42] we are replacing them for space reasons, right? [18:06:00] well, warranty and performance too [18:06:11] es2001: 9.1T | es2011: 11T ... [18:06:14] not much more [18:06:17] with time buying failed disks is not worth it [18:06:40] yes, but it is not compressed [18:06:47] and it is read-only [18:07:13] the read-write ones have like 2/3TB only [18:07:47] and we can move them on a schema basis [18:08:15] true, easy to "shard" [18:08:29] there are actually several shards on the same machine [18:10:55] also, 3 terabyes are like 3 years of edits, worse case scenario [18:11:27] and we have now a total of 16 free [18:11:31] more, probably [18:11:52] ok [18:12:39] see https://grafana.wikimedia.org/dashboard/db/server-board?panelId=17&fullscreen&from=1455732745144&to=1455818905144&var-server=es2006&var-network=eth0 [18:12:47] es1 had no space problems [18:12:53] es2 and 3 did [18:13:45] and now we can fit all of es2 and 3 in es1 if needed, and still have 24TB free [18:14:49] (all of this before compressing, which will happen at some point) [18:38:26] jynus: still not authorized in icinga... :( [18:38:47] what is missing? [18:38:51] but I should be... [18:41:48] checking the gerrit of the other day [18:42:50] https://gerrit.wikimedia.org/r/#/c/271022/1/modules/icinga/files/cgi.cfg [18:45:50] jynus: solved... I had to login with uppercase first [18:46:26] I've acked the alerts [18:46:34] this was impossible to avoid, actually [18:46:51] thanks, I'm putting downtime [18:46:57] unless we caught the alerts between being added and being checked [18:46:58] was, between the 1st and 3rd check [18:47:03] yep [18:47:07] the important ones are the production ones [18:47:41] e.g. when you stop mysql on a one that is currently online [18:49:00] why we didn't get the pages in the channel? [18:49:37] we did, I think [18:49:57] "PROBLEM - mysqld processes on es2018 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld" [18:50:07] there is an issue there [18:50:25] only one per server [18:50:39] icinga has 4 failing checks each [18:50:47] ah only the critical one [18:50:48] ok [18:50:54] lag should not be critical on codfw [18:51:01] so we are paged incorrectly [18:51:12] that needs investigation [18:51:22] ok [18:51:40] hoo complained recently (with reason) and could not find the cause [18:51:54] so probably some misconfiguration of the dba group [18:54:00] you are doing a good job, learning fast [18:54:52] I've put all of them in downtime until the 29th spread from noon to the afternoon [18:55:11] all the new ones? [18:55:20] yes [18:55:37] makes sense [18:55:54] one day we will get rid of icinga and all this nonsense [18:56:11] looking forward to it :) [18:56:44] the many alerts are a problem, because on one side I want to merge many of those [18:56:56] on the other side I want to individually downtime them [18:58:24] what about having a check from which all checks in a hosts depends upon that checks for the nonexistence of a file on the host [18:59:20] if you touch that file it gives a warning/error without page and silence depending checks (I think is doable on nagios/icinga) [18:59:37] well, most of those checks are common, so this is the wrong channel :-) [18:59:44] true :) [18:59:46] if you know what I mean [19:00:19] mail ops, create a ticket ,etc [19:00:43] I can tell you that many things are not done due to prioritization of tasks [19:01:16] e.g. monitoring is like 1/10th of the responsabilities of one person [19:01:32] yeah thats normal [19:02:07] so, any proposal is welcome, specially if it is easily actionable [19:02:36] in other words "show me da patch" [19:06:38] :) [19:06:56] https://gerrit.wikimedia.org/r/#/c/271577/1 [19:07:16] ^that is old news [19:07:47] saw, you're quick... I was checking the deployments page [19:07:52] there is a MediaWiki train.... [19:08:07] yeah, better delay [19:08:35] I'll check later if they have finished [19:08:40] that is why you should have a proper, interesting task, and this being your background one [19:08:43] :-) [19:09:58] so as a homework for the next day, choose something you want to do, either a DBA task or something you saw very wrong already that you want to fix [19:10:24] ok :) [20:18:34] mysqlbinlog doesn't like the ssl-* options in [client], we should use the loose-ssl-* probably, that do you think? [20:19:15] see also https://bugs.mysql.com/bug.php?id=74864