[04:55:22] 10DBA, 10Goal, 10Patch-For-Review: Productionize db2[103-120] - https://phabricator.wikimedia.org/T222772 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2114.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201905100455_ma... [05:40:34] 10DBA, 10Goal, 10Patch-For-Review: Productionize db2[103-120] - https://phabricator.wikimedia.org/T222772 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2114.codfw.wmnet'] ` and were **ALL** successful. [05:40:50] 10DBA, 10Goal, 10Patch-For-Review: Productionize db2[103-120] - https://phabricator.wikimedia.org/T222772 (10Marostegui) [05:41:03] 10DBA, 10Operations, 10ops-codfw: db2114 hardware problem - https://phabricator.wikimedia.org/T222753 (10Marostegui) This host has been re-imaged successfully [06:08:02] so snapshots are ongoing, now on its last ones [06:08:12] and individually they are not too slow [06:08:41] \o/ [06:08:43] but because they are generated in random order, there is huge amount of resources wasted (e.g. 2 generated on a single dbprov most of the time) [06:08:43] good! [06:09:06] So I think I am going to implement a generic sort option [06:11:02] 10DBA, 10Goal, 10Patch-For-Review: Productionize db2[103-120] - https://phabricator.wikimedia.org/T222772 (10Marostegui) [06:11:11] if you need to provision some, they are almost all done except s1 and s5 [06:11:37] I might take s4 yeah [06:11:46] Not sure yet if today or monday [06:11:53] btw [06:12:01] should we disable notifications for backup sources? [06:12:07] to avoid them sending irc alerts? [06:12:08] yeah, probably [06:12:12] on lag [06:12:16] yeah [06:12:41] for the ones that do stop slave [06:12:43] at least [06:12:59] yeah [06:16:03] or we can increase the alert to a few hours [06:16:20] up to you, I don't mind either way :) [06:16:21] on all dbstores [06:16:43] that way we know something bad is happening if it lags for e.g. > 1 day [06:17:01] it will still show up in icinga, so we would see it [06:26:24] The stats gathered are interesting https://phabricator.wikimedia.org/P8506 [06:30:34] so revision_actor_temp is now almost as big as revision! [07:12:58] jynus mark I have started a brainstorm ideas section for the metrics on our etherpad - line #1 [07:13:18] as it says, it is just a brainstorm [07:13:48] some will be discard some others can maybe develop on something better [08:40:45] marostegui: check https://gerrit.wikimedia.org/r/c/operations/puppet/+/509343 [08:40:50] ok [08:46:05] great, thanks :) [08:46:41] few quick comments [08:46:53] there are of course all kinds of metrics we could track to see things that affect the team [08:46:57] such as hardware failures etc [08:47:04] but hardware failures is not something you can control/improve [08:47:14] so while it's useful to know, that's not a great metric for _this_ purpose [08:47:20] yeah, but they are time consuming [08:47:21] a related one [08:47:22] yes [08:47:27] but that's not the point for this [08:47:27] As I said, it is just a brainstorming :) [08:47:30] ok [08:47:32] so useful to track, but not for annual planning [08:47:34] so one that is related [08:47:39] and would be better for THIS purpose [08:47:40] is [08:47:53] percentage of hardware failures that required manual emergency intervention [08:47:56] or something along those lines :) [08:48:01] because that is something we can (maybe) improve [08:48:06] although also limited by software etc [08:48:08] but you get the idea ;) [08:48:10] yeah [08:48:32] maybe better than that would be emergency depoolings in general (which tend to be hardware related) [08:48:35] but would be broader [08:48:52] right [09:20:09] we need to optimize better s8 tables on eqiad: https://phabricator.wikimedia.org/P8506#50907 [09:20:42] the best optimization will be to kill wb_terms! [09:20:48] indeed [09:21:00] but still, that is a 30% difference [09:22:17] even compressed, the difference is not small: 397G vs 439G [09:26:43] the metadata was not a waste of time! https://phabricator.wikimedia.org/P8506#50908 [09:27:10] * marostegui saving all those useful queries [09:27:44] 66 GB on ibdata1 we could avoid, for example [09:28:27] it will be even more useful when we have the table inventory [09:45:29] 10DBA, 10Patch-For-Review: Create a recovery/provisioning script for database binary backups - https://phabricator.wikimedia.org/T219631 (10jcrespo) A first version has been intetrated into transfer.py: ` root@cumin2001:~$ time transfer.py --no-checksum --no-encrypt --type=decompress dbprov2001.codfw.wmnet:/s... [09:46:18] 10DBA, 10Patch-For-Review: Create a recovery/provisioning script for database binary backups - https://phabricator.wikimedia.org/T219631 (10jcrespo) a:03jcrespo [09:59:53] 10DBA, 10Patch-For-Review: Create a recovery/provisioning script for database binary backups - https://phabricator.wikimedia.org/T219631 (10Marostegui) \o/ Check the cheatsheet I started and please add/remove/modify stuff accordingly to keep all these useful commands somewhere for now :-) https://wikitech.wiki... [10:06:43] marostegui: would it be easy to measure/track emergency depoolings? [10:06:51] i guess now they are mediawiki config changes (for core dbs) [10:06:54] and soon they would be automated [10:08:32] mark: Yeah, normally they go associated to a ticket (and they have an entry on SAL), so probably we can keep track of them [10:08:43] ANd usually go associated to a page (in most cases) [10:08:45] emergency depoolings normally end up being a ticket or incident [10:09:44] right [10:09:51] that seems a reasonable metric then [10:09:57] i just hope that this team has enough influence on that metric... [10:09:58] however, while that is a metric we would like to track, I don't see much relation with the work we do [10:10:02] yeah [10:11:24] I do, cause I think we can track how many outages/incidents/non planned things we have to deal with [10:11:41] which is a big chunk of our work [10:11:43] as metrics in general should be something related to reality and not wish (we would like to control failover, but we don't at the moment) [10:12:05] sure, "incidents/non planned things we have to deal with" I agree [10:12:15] but that is technically a different metric [10:13:01] but again, it depends if we want to measure database health or team progress/work [10:14:18] for example if someone goes and drops a table, we will and can fix it, but not sure we can avoid that at the moment [10:15:45] but the metric isn't about wishing lists or preventing things, right? [10:16:00] I would go for more functional things like tickets/tasks/reviews/incidents done/handled if team work is intended to be measured [10:16:29] I would prefer if we go for the unexpected things route (as discussed, emergencies) [10:16:47] so more emergencies == better or less emergencies == better? [10:17:07] I don't think it has to be looked at that way, it is just to keep track right? [10:17:49] well, let me redo the question, of May = 3 emergencies June => 15 emergencies, what does that say? [10:17:54] *if [10:19:02] If it is a year metric, it could say: the DBA team had to handle XXX emergencies, what can we do to prevent or ease them? automatic failoveR? automatic pooling/depooling? is our hardware not reliable? do we have enough DBAs? etc [10:19:08] Maybe I am reading it the wrong way [10:19:26] so, 0 emergencies, all is good? [10:19:38] my point is that I don't see the usefulness of that [10:19:41] that's not realistic [10:20:00] I see the point if we want to evaluate what could be a priority based on that metric [10:20:00] but following your train of thought I would prefer to measure things like database errors [10:20:21] more database errors => let's give a closer look of what is happening [10:20:29] let's see why [10:20:32] etc. [10:20:35] sure, I am not saying it has to be emergencies stuff, just giving my opinion on that particular metric :) [10:20:48] I am also ok with DB errors yeah [10:20:55] marostegui: so i don't think _these_ metrics are just to keep track [10:20:58] these are KPIs [10:21:07] the thing with pages or incidents is that it is not very objective [10:21:15] so while a lot of those metrics would be good to keep track of [10:21:24] they are not necessarily metrics that this team can have much influence on [10:21:39] they are important, we should be aware of them, but they are not motivating or indicating progress for this team [10:21:46] I see [10:21:56] so if we are picking only a single metric here, it may make sense to pick one where we can have influence on [10:22:15] we can obviously not cover everything, or much [10:22:25] but we can find one that covers at least some important element of the work that you do [10:22:29] and indicates progress (or not) [10:22:41] if it's something you can't influence then it's just demotivating [10:22:52] why bother with it, you have no control over it anyway [10:23:01] so if we go back to the list we have initially created I guess lines: 4,7,8,9,12? [10:23:11] could be the ones that could potentially applied? [10:23:15] *apply [10:23:15] also metrics you can influence with "let's not report this as an incident" I don't like [10:23:39] marostegui: indeed [10:23:45] jynus: agreed, we should avoid metrics which can be easily gamed [10:24:21] I am not a particular fan of line 12, because external requests can be maaaany things, and sometimes they are not for us, or there are IRC/mail conversations and not tickets etc [10:24:34] yes, any external request is tricky [10:24:47] "average time since last mariadb start"... [10:24:51] marostegui: like yours, I was brain dumping :) [10:24:57] I think i see what you mean, but it's not necessarily the case that making that very short is agood thing, is it ;) [10:25:07] jynus: sure, just commenting on that, don't take it personal :) [10:25:16] that means you may be doing (security) upgrades well [10:25:21] or maybe mariadb is crashing every day [10:25:25] heh [10:25:36] neither very short nor very long is good ;) [10:26:21] well, it is a metric, as you said doesn't have to be min_max one :-) [10:27:01] I would go for uptime as line 10 / line 11 [10:27:07] in pct [10:27:34] Queries served in which way? [10:27:50] Maybe average query latency? [10:28:00] sum(mariadb_global_status) from prometheus [10:28:06] sum(mariadb_global_status_queries) from prometheus [10:28:10] :-) [10:29:19] I don't know, you decide [10:29:31] I don't know either :) [10:29:39] They all have good and bad things! [10:30:06] "average number of hours worked per week" [10:30:08] see the -foundations channel for some discussion around version upgrades [10:30:08] What about the version running path? do we want to explore that one? [10:30:12] there it's about debian versions [10:30:18] that's tricky too because they release like once a year or less [10:30:20] and then everything jumps [10:30:24] but I had a proposal for it [10:30:29] same for mariadb [10:30:37] and then you have like now [10:30:46] a 10.1.40 we are going to skip [10:30:50] on purpose [10:31:45] if it is a 1 year metric, we can track upgrades to 10.3/buster [10:31:52] which is 0 right now [10:32:30] technically 1 if we could that test host [10:33:46] yeah maybe buster/10.3.XX is a good one [10:34:12] so in the -foundations channel I said that you could define a statistical distribution for OS upgrades, in the OS upgrade policy [10:34:22] "all systems should be upgraded over a period of 4 years after a new release" [10:34:28] but that distribution is probably not even in their case [10:34:31] not uniform [10:34:35] probably more uniform for databases :) [10:34:45] so you could then track actual upgrades against that distribution in a metric [10:34:58] so if you go faster than the stats distribution you do well, slower than it then not so well [10:35:06] but it's tricky and a bit complicated [10:35:12] and what should the distribution be really [10:35:28] yeah, and might depends on external things too (ie: OS regression) [10:35:42] i think there will always be SOME external influences [10:35:45] and that's ok [10:35:53] i think we want teams to have a reasonable amount of control over [10:35:57] the problem those are good short term metrics [10:35:59] not necessarily 100% control, and not 0% control [10:36:09] like ok, no blockers now for X [10:36:21] but they are limited in scope [10:37:17] I think errors is ok, there will always be errors, and we can do something about it, and in the end users want less errors [10:37:49] e.g. we can reduce time for switchover [10:38:28] the problem is there will be months with the normal baseline of errors [10:38:41] and others where an incident creates millions of them [10:38:43] or the average query latency [10:38:57] but who has most control over that [10:38:59] you or the developers? [10:39:00] and they don't necessary correlate to user impact [10:39:11] I would say 50%50% [10:39:13] they are valid metrics to track [10:39:23] but not necessarily the best KPI for _this team_ [10:39:36] if you have to pick one [10:39:36] query latency I don't expect it to change [10:40:01] and if it changes, it will be because things like physical location of the top used servers [10:40:37] yes, there are long running queries, but they are 1 every 10 minutes vs the 300K we have every second [10:43:28] so far I think I like emergency depools best? [10:43:35] which also has user impact, it matters [10:43:40] it's something you have some control over [10:43:44] (not complete though...) [10:43:53] it's reasonably easy to measure over time [10:43:53] sure, but what does that mean? [10:43:56] Or master's running version? [10:43:58] and it shows a pain point [10:44:14] more manual stuff == worse? [10:44:18] yeah [10:44:51] i wouldn't say I love that metric, but it seems at least reasonable [10:44:58] what is the control we have about that? [10:45:08] as in, we do those [10:45:21] yeah it is a bit limited isn't it [10:45:28] building more redundancy is not easy just on your end [10:45:32] but we don't have a plan to work on that, for example [10:45:34] and you don't control hw failures [10:45:39] other than not letting things get super old [10:45:56] in the next year, I mean [10:45:59] fair [10:46:03] these are also longer term metrics btw [10:46:07] tracked up to 3-5 eyars [10:46:12] we need a target for next year and over 3-5 years [10:46:14] ok, then [10:46:20] so in that sense, we could show progress over a longer time [10:46:24] then manually handled incidents [10:46:27] is ok [10:46:30] yeah [10:46:35] not only depools [10:46:41] only problem is, we need to set a target so we need to know what is realistic for next year(s) [10:46:45] beacuse we may not even have depools at all at some pint [10:46:53] hehe [10:46:54] well that's ok [10:46:57] then the metric becomes perfect [10:47:01] haha [10:47:09] but I see what you mean [10:47:32] it is the same metric [10:47:40] just I am calling it genericly [10:48:17] for example, I guess it would include master switches? [10:48:23] (planned and unplanned) [10:48:29] because they are not automated? [10:48:54] technically it is a "master depool" [10:49:15] But I thought we said only "emergencies" one? [10:49:24] and something we aim to automate [10:49:36] ok, so incident requiring manual configuration changes [10:49:40] *incidents [10:49:46] like a manual depool [10:49:49] something along that lines I think [10:49:53] or an emergency master switch [10:50:00] yeah [10:50:12] technically ,we started automating a master (non emergency) switch [10:50:17] so it makes sense [10:51:06] I am ok with that [10:51:36] if we need another for backups, I am not sure about that one [10:52:23] I liked line 8, but we are not there yet [10:52:32] percentage of sucessful ones doesn't really cover well the intentions [10:52:32] successfully tested restoring backups [10:52:40] we have partial coverage, for example [10:53:05] if it is long term, I wouldn't mind that [10:53:17] but we aldo don't have that planned for next year [10:53:21] but in 2 years [10:54:31] yep [10:55:58] we can maybe redefine it in 1 year with testing [10:55:58] yes lets pick something we make progress on in the upcoming year [10:56:05] and fow now [10:56:10] we may pick new metrics next year or so [10:56:12] call it accordinf to monitoring [10:56:46] we don't have full testing, but we detect already simple errors and problems [10:56:56] and we do have those from time to time [10:57:01] yeah that's ok [10:57:07] i think we had a similar metric last year planned [10:57:07] and they are a metric of our backup reliability [10:57:18] that WFM [10:57:51] and in a year, when we have full testing with the hw we just bought, we got stricter [10:58:00] *get [10:58:18] sounds good [10:58:29] we just need a baseline to set a target for in 1 year [10:58:32] I reworded line 2 and 9 below with the discussin, marostegui [10:58:34] for both metrics [10:58:54] marostegui: see if they are ok, or prefer the original, etc. [10:59:16] mark: backups or database backups? [10:59:35] well [10:59:36] ideally backups [10:59:40] but if that's not realistic [10:59:43] then database backups is ok for now? [10:59:45] I know, but we don't have metrics of those [10:59:47] now [10:59:48] yeah [10:59:55] then let's take db backups for now [10:59:56] but I intend to change that too :-D [11:00:00] yes [11:00:40] jynus: i am ok with the reordering [11:00:49] I think the 2 are not perfect but are good enough [11:01:24] I changed "generated correctly" to "generated without errors" [11:01:30] a bit more specific [11:01:36] ok if I remove all the other lines which we won't pick? [11:01:53] yeah move them aside [11:01:58] they're still nice ideas to keep around [11:02:17] so [11:02:24] is one of you able to set a target for number of emergency depools today? [11:02:30] we need an estimate for next fiscal year [11:02:37] as a target [11:02:38] buf [11:02:44] let me check historics [11:02:56] hmm [11:02:59] for backups what would be the goal, 99% [11:03:01] and "manual configuration changes" [11:03:05] i don't like that wording either [11:03:09] if we automate it like the current goal [11:03:14] then that doesn't fully apply [11:03:23] I thought that was the point? [11:03:35] emergency depools on which timeframe? current FY? [11:03:35] if things are handled automatically == good [11:03:39] well [11:03:44] they are not entirely automatic [11:03:47] you are still manually depooling [11:03:52] just through an easier process [11:03:54] that is good [11:04:00] how about [11:04:09] "Number of emergency manual replication topology changes"? [11:04:12] hm no [11:04:15] that doesn't do it [11:04:20] depooling is not a toplogy change [11:04:25] exactly [11:04:34] we can call it by its name [11:04:46] manual replica depools + master failovers [11:04:56] i guess so [11:04:57] those are I think most of them [11:04:57] under emergencies, right? [11:05:00] yeah [11:05:01] yeah [11:05:01] not planned [11:05:04] ok [11:05:36] e.g. let's not add planned and slow hardware upgrades [11:05:40] so they're asking for target numbers "in 1 year" "in 3-5 years" [11:05:44] agreed [11:06:20] 90% I am going to say in 1 year [11:06:26] for backups [11:06:34] 99% in 5? [11:06:42] 99.something [11:06:53] 90 and 98 [11:07:00] for the other [11:07:12] do we have 10% failing backup runs today? [11:07:26] we have less than 10 backups (20 on both datacenters) [11:07:35] 1 failing every time is quite common [11:07:47] in current heavy development [11:07:52] ok [11:07:56] but would that still be the case 1 year from now? [11:07:56] all failing is quite common [11:08:04] no the intention [11:08:06] *not [11:08:14] assuming staffing etc [11:08:17] but on software upgrade [11:08:24] the most common result is all failing [11:08:27] and retry [11:08:33] the question is what is a backup failing? [11:08:39] "without errors [11:08:40] " [11:08:41] is what it says now [11:08:52] and failure is common, and then it should justy retry [11:08:53] we could make that "critical errors" (not warnings or something) [11:09:03] really it should mean "is the backup usable" [11:09:07] if a retry is not an error [11:09:08] but yeah we can't actually test that yet [11:09:15] it should be much closer to 100% [11:09:20] i think if a retry happens within a reasonable amount of time that's ok [11:09:22] and is not an error [11:09:28] agreed [11:09:32] if it succeeds that is ;) [11:09:44] yeah, retry is in may backlog [11:10:00] we do it manually has to be automated [11:10:00] so is say 95% and 99% reasonable for 1 year and 3-5 year? [11:10:11] or even higher than 99% ideally [11:10:11] but yeah [11:10:23] i don't know if it's realistic :) [11:10:31] yeah, lots of unknowns [11:10:35] yes :( [11:10:40] beacause we are building it [11:10:47] more backups will happen [11:10:51] more backup types [11:10:53] etc [11:11:15] maybe we can change these metrics later still, this is just the first deadline I think [11:11:16] including ES [11:11:19] which will be tricky [11:11:19] but is say 95% and 99% ok for now? [11:11:22] right [11:11:27] let's say that [11:11:56] then we should find a definition that could work longterm [11:12:03] ok [11:12:05] i will put this in for now [11:12:22] for failovers [11:12:52] it depends, we had 2 at the same time before previous switch dc [11:13:05] and we had 2 in the last month, right? [11:13:11] yeah [11:13:32] 10 in a year? [11:13:39] unplanned? [11:13:43] yeah [11:13:46] yeah [11:13:51] sounds about right [11:14:11] we didn't have a master failure in a long time [11:14:12] I recall 4 in the last 5 months [11:14:27] don't jinx it!! [11:15:00] oh god [11:15:03] i fear upcoming weekend [11:15:08] hahaha [11:16:27] I can try to find emergency tickets that required config changes [11:16:33] 10 and 1 [11:16:36] in the current t FY [11:17:27] we have fast depools at the moment all the time [11:17:54] fast depools? [11:18:13] not sure if fast of slow, but "there are errors on X, depool" [11:18:22] so [11:18:33] "Number of manually induced database depools + master changes" [11:18:35] how does that sound? [11:18:43] sorry [11:18:44] EMERGENCY [11:18:47] yeah [11:19:02] I don't like induced, but the rest is ok [11:19:10] i mean to say [11:19:17] "automated" is not very well defined [11:19:22] if you have a "depool" command [11:19:24] which does everything [11:19:29] that's kinda automated, for the procedure [11:19:33] but you're still needed to do it [11:19:38] as opposed to fully automatic failover [11:19:45] so something that distinguishes betweent hat is maybe needed? [11:19:48] commanded? [11:19:53] nah [11:19:53] chosen? [11:19:55] instigated? [11:19:57] triggered? [11:19:58] lol [11:20:00] manually triggered [11:20:03] I like that [11:20:05] ok [11:20:38] i will use that for now, putting it in the doc [11:20:39] induced is like you convince the sever to go away, "come on please" [11:20:43] haha [11:20:45] fair enough [11:20:52] xddd [11:21:10] great [11:21:14] marostegui: if misc, labs and core are accounted [11:21:17] if you can get me some targets for these metrics we are done :) [11:21:27] 10 may be even too unrealistic [11:21:30] that will be tricky [11:21:34] e.g. parser cache failure [11:21:39] I can do checks [11:21:51] to come up with something approx [11:22:12] 1/10 95%/99% was the working result [11:22:17] sorry, 10/1 [11:22:26] mark: I can scan for the current FY and see what I can get [11:22:29] but yeah, we can refine with better data [11:23:21] those are yearly results [11:23:49] but can be reported every month, etc. [11:24:01] marostegui: great [11:24:20] i guess we can aggregate over the year and divide by 12 for per month ;) [11:24:26] we'll have 0.3 depools in a month ;) [11:24:39] no, but we can honestly have 10 in a day [11:24:48] and no other at other times [11:25:15] I will do it after lunch and report back here [11:26:07] understood [11:27:10] it is all funny, because the main reason we do backups [11:27:16] is to do recoveries [11:27:31] but the last thing you want is to have "lots of recoveries" [11:27:59] but if the goal is to have 0 recoveries, why are you doing backups? XD [12:00:54] labsdb1011: 86.30% /srv utilization [12:01:51] I will see if s8 on eqiad can be optimized on a couple of hosts, wikireplicas including [12:36:13] only 1011? maybe a temporary table? [13:02:56] mark: I am gathering stats, so far related to emergency master failovers, in the current FY we have done 7 unplanned master failovers (each failover requires a minimum of 4 MW deployments) [13:03:24] I am going to try to get a number now for emergency config changes not related to master failovers [13:03:36] ok [13:46:04] mark: From what I have been able to gather, during the current FY we had 22 unplanned incidents that required depooling replicas in a rush. Being very conservative, each of those would require at very least 4 commits (depool, repool, weight tackling, and in some cases depooling another host to reclone the dead one), so maybe around 88 MW deployments. [13:46:10] You need more details? [13:46:38] I think it is way more than 88, but that is the bare minimum [13:49:36] well [13:49:39] we dont count commits [13:49:39] when I said 10, I wasn't counting each individual commit as an action [13:49:40] we count depools [13:49:43] however those are implemented [13:50:19] so 22 is the number we were looking for [13:50:24] what is a realistic target for next year then? [13:50:34] probably not really better eh :P [13:50:36] mark: ok, so then 22 a minimum of 22 depools (possible a bit more as we had to clone those hosts from other hosts, and early in the FY we didn't have backup sources to use for recloning) [13:50:51] isn't 29 the number? [13:50:56] 22 + 7? [13:51:11] or did you add those already? [13:51:12] I separated masters from replicas [13:51:27] for a total of 22 or 29? [13:51:33] 7 masters depooled + 22 replicas, so a total of 29 [13:51:39] (strictly speaking) [13:51:42] and the year isn't done yet [13:51:45] 1.5 month left [13:51:47] yep [13:51:52] so 10 would be even unrealistic :-D [13:51:55] how about we set a target of 30 [13:52:02] is that too optimistic? [13:52:09] it would show slight improvement [13:52:10] yeah, as jynus jinxed it, we can expect a master going down this weekend :p [13:52:28] and if we don't hit it, then we've regressed, which is probably for good reasons we need to investigate [13:52:29] marostegui: commons, calling it now [13:52:34] for memory issues [13:52:36] I am scared of commons indeed [13:52:47] That is the one I want to replace as soon as possible, even before the end of the FY [13:53:00] mark: I think 31-32 might be more realistic [13:55:14] ok [13:55:22] and 3-5 year? [13:55:30] let's cut it in half at least [13:55:31] as an aim [13:55:33] hopefully much better [13:55:42] hopefully indeed [13:55:57] so aiming for maybe 18? [13:56:02] 16 [13:56:08] haha [13:56:10] ok [13:56:53] lets say 10 [13:57:09] 22 were replicas [13:57:32] less than 1 per month? [13:57:37] that with etcd + monitoring could be automated "easily" [13:57:51] unlike masters, that require more work [14:01:48] well [14:02:04] manuel ok with that? :) [14:02:59] I think 10 is too aggressive [14:03:03] I was saying a guess, best practices normaly require to discard the outliers [14:03:06] Can this metric be re-evaluated in a year maybe? [14:04:00] (highest and lowest value) [14:05:27] marostegui: i have no idea :) [14:05:51] let's go with 16 then if we're more confident, if we do even better, well, great [14:05:57] +1 [14:06:06] +1 [14:06:42] ok [14:06:44] i've added this [14:06:47] and now i'm going off [14:06:55] bye! [14:07:02] thanks both :) [14:07:10] byeee [14:07:39] I am going off too [14:07:42] Have a nice weekend! [14:08:20] bye! [16:37:07] 10DBA, 10Patch-For-Review: Document clearly the mariadb backup and recovery setup - https://phabricator.wikimedia.org/T205626 (10jcrespo) [16:40:43] 10DBA, 10Goal: Purchase and setup remaining hosts for database backups - https://phabricator.wikimedia.org/T213406 (10jcrespo) 05Open→03Resolved a:03jcrespo I'd say, after closing all children, that this is done. [16:40:45] 10DBA, 10Goal, 10Patch-For-Review: Implement database binary backups into the production infrastructure - https://phabricator.wikimedia.org/T206203 (10jcrespo) [16:42:49] 10DBA, 10Goal, 10Patch-For-Review: Decommission dbstore1001, dbstore2001, dbstore2002 - https://phabricator.wikimedia.org/T220002 (10jcrespo) MySQL and Prometheus have been stopped on the above hosts. This is almost ready, only pending wait some time and see if there is something we would like to keep from t... [16:53:25] 10DBA, 10Goal, 10Patch-For-Review: Implement database binary backups into the production infrastructure - https://phabricator.wikimedia.org/T206203 (10jcrespo) Things pending I would like to work on: * Proper documentation * Identify failures after X amount of timeout/time passed and easy cleanup of file le... [17:26:58] 10DBA, 10Goal, 10Patch-For-Review: Decommission dbstore1001, dbstore2001, dbstore2002 - https://phabricator.wikimedia.org/T220002 (10Marostegui) I got green light from Chase via email to decom these hosts [18:42:09] 10DBA: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978 (10Marostegui) [18:42:24] 10DBA: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978 (10Marostegui) p:05Triage→03High