[04:55:22] 10DBA, 10Goal, 10Patch-For-Review: Productionize db2[103-120] - https://phabricator.wikimedia.org/T222772 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2114.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201905100455_ma... [05:40:34] 10DBA, 10Goal, 10Patch-For-Review: Productionize db2[103-120] - https://phabricator.wikimedia.org/T222772 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2114.codfw.wmnet'] ` and were **ALL** successful. [05:40:50] 10DBA, 10Goal, 10Patch-For-Review: Productionize db2[103-120] - https://phabricator.wikimedia.org/T222772 (10Marostegui) [05:41:03] 10DBA, 10Operations, 10ops-codfw: db2114 hardware problem - https://phabricator.wikimedia.org/T222753 (10Marostegui) This host has been re-imaged successfully [06:08:02] so snapshots are ongoing, now on its last ones [06:08:12] and individually they are not too slow [06:08:41] \o/ [06:08:43] but because they are generated in random order, there is huge amount of resources wasted (e.g. 2 generated on a single dbprov most of the time) [06:08:43] good! [06:09:06] So I think I am going to implement a generic sort option [06:11:02] 10DBA, 10Goal, 10Patch-For-Review: Productionize db2[103-120] - https://phabricator.wikimedia.org/T222772 (10Marostegui) [06:11:11] if you need to provision some, they are almost all done except s1 and s5 [06:11:37] I might take s4 yeah [06:11:46] Not sure yet if today or monday [06:11:53] btw [06:12:01] should we disable notifications for backup sources? [06:12:07] to avoid them sending irc alerts? [06:12:08] yeah, probably [06:12:12] on lag [06:12:16] yeah [06:12:41] for the ones that do stop slave [06:12:43] at least [06:12:59] yeah [06:16:03] or we can increase the alert to a few hours [06:16:20] up to you, I don't mind either way :) [06:16:21] on all dbstores [06:16:43] that way we know something bad is happening if it lags for e.g. > 1 day [06:17:01] it will still show up in icinga, so we would see it [06:26:24] The stats gathered are interesting https://phabricator.wikimedia.org/P8506 [06:30:34] so revision_actor_temp is now almost as big as revision! [07:12:58] jynus mark I have started a brainstorm ideas section for the metrics on our etherpad - line #1 [07:13:18] as it says, it is just a brainstorm [07:13:48] some will be discard some others can maybe develop on something better [08:40:45] marostegui: check https://gerrit.wikimedia.org/r/c/operations/puppet/+/509343 [08:40:50] ok [08:46:05] great, thanks :) [08:46:41] few quick comments [08:46:53] there are of course all kinds of metrics we could track to see things that affect the team [08:46:57] such as hardware failures etc [08:47:04] but hardware failures is not something you can control/improve [08:47:14] so while it's useful to know, that's not a great metric for _this_ purpose [08:47:20] yeah, but they are time consuming [08:47:21] a related one [08:47:22] yes [08:47:27] but that's not the point for this [08:47:27] As I said, it is just a brainstorming :) [08:47:30] ok [08:47:32] so useful to track, but not for annual planning [08:47:34] so one that is related [08:47:39] and would be better for THIS purpose [08:47:40] is [08:47:53] percentage of hardware failures that required manual emergency intervention [08:47:56] or something along those lines :) [08:48:01] because that is something we can (maybe) improve [08:48:06] although also limited by software etc [08:48:08] but you get the idea ;) [08:48:10] yeah [08:48:32] maybe better than that would be emergency depoolings in general (which tend to be hardware related) [08:48:35] but would be broader [08:48:52] right [09:20:09] we need to optimize better s8 tables on eqiad: https://phabricator.wikimedia.org/P8506#50907 [09:20:42] the best optimization will be to kill wb_terms! [09:20:48] indeed [09:21:00] but still, that is a 30% difference [09:22:17] even compressed, the difference is not small: 397G vs 439G [09:26:43] the metadata was not a waste of time! https://phabricator.wikimedia.org/P8506#50908 [09:27:10] * marostegui saving all those useful queries [09:27:44] 66 GB on ibdata1 we could avoid, for example [09:28:27] it will be even more useful when we have the table inventory [09:45:29] 10DBA, 10Patch-For-Review: Create a recovery/provisioning script for database binary backups - https://phabricator.wikimedia.org/T219631 (10jcrespo) A first version has been intetrated into transfer.py: ` root@cumin2001:~$ time transfer.py --no-checksum --no-encrypt --type=decompress dbprov2001.codfw.wmnet:/s... [09:46:18] 10DBA, 10Patch-For-Review: Create a recovery/provisioning script for database binary backups - https://phabricator.wikimedia.org/T219631 (10jcrespo) a:03jcrespo [09:59:53] 10DBA, 10Patch-For-Review: Create a recovery/provisioning script for database binary backups - https://phabricator.wikimedia.org/T219631 (10Marostegui) \o/ Check the cheatsheet I started and please add/remove/modify stuff accordingly to keep all these useful commands somewhere for now :-) https://wikitech.wiki... [10:06:43] marostegui: would it be easy to measure/track emergency depoolings? [10:06:51] i guess now they are mediawiki config changes (for core dbs) [10:06:54] and soon they would be automated [10:08:32] mark: Yeah, normally they go associated to a ticket (and they have an entry on SAL), so probably we can keep track of them [10:08:43] ANd usually go associated to a page (in most cases) [10:08:45] emergency depoolings normally end up being a ticket or incident [10:09:44] right [10:09:51] that seems a reasonable metric then [10:09:57] i just hope that this team has enough influence on that metric... [10:09:58] however, while that is a metric we would like to track, I don't see much relation with the work we do [10:10:02] yeah [10:11:24] I do, cause I think we can track how many outages/incidents/non planned things we have to deal with [10:11:41] which is a big chunk of our work [10:11:43] as metrics in general should be something related to reality and not wish (we would like to control failover, but we don't at the moment) [10:12:05] sure, "incidents/non planned things we have to deal with" I agree [10:12:15] but that is technically a different metric [10:13:01] but again, it depends if we want to measure database health or team progress/work [10:14:18] for example if someone goes and drops a table, we will and can fix it, but not sure we can avoid that at the moment [10:15:45] but the metric isn't about wishing lists or preventing things, right? [10:16:00] I would go for more functional things like tickets/tasks/reviews/incidents done/handled if team work is intended to be measured [10:16:29] I would prefer if we go for the unexpected things route (as discussed, emergencies) [10:16:47] so more emergencies == better or less emergencies == better? [10:17:07] I don't think it has to be looked at that way, it is just to keep track right? [10:17:49] well, let me redo the question, of May = 3 emergencies June => 15 emergencies, what does that say? [10:17:54] *if [10:19:02] If it is a year metric, it could say: the DBA team had to handle XXX emergencies, what can we do to prevent or ease them? automatic failoveR? automatic pooling/depooling? is our hardware not reliable? do we have enough DBAs? etc [10:19:08] Maybe I am reading it the wrong way [10:19:26] so, 0 emergencies, all is good? [10:19:38] my point is that I don't see the usefulness of that [10:19:41] that's not realistic [10:20:00] I see the point if we want to evaluate what could be a priority based on that metric [10:20:00] but following your train of thought I would prefer to measure things like database errors [10:20:21] more database errors => let's give a closer look of what is happening [10:20:29] let's see why [10:20:32] etc. [10:20:35] sure, I am not saying it has to be emergencies stuff, just giving my opinion on that particular metric :) [10:20:48] I am also ok with DB errors yeah [10:20:55] marostegui: so i don't think _these_ metrics are just to keep track [10:20:58] these are KPIs [10:21:07] the thing with pages or incidents is that it is not very objective [10:21:15] so while a lot of those metrics would be good to keep track of [10:21:24] they are not necessarily metrics that this team can have much influence on [10:21:39] they are important, we should be aware of them, but they are not motivating or indicating progress for this team [10:21:46] I see [10:21:56] so if we are picking only a single metric here, it may make sense to pick one where we can have influence on [10:22:15] we can obviously not cover everything, or much [10:22:25] but we can find one that covers at least some important element of the work that you do [10:22:29] and indicates progress (or not) [10:22:41] if it's something you can't influence then it's just demotivating [10:22:52] why bother with it, you have no control over it anyway [10:23:01] so if we go back to the list we have initially created I guess lines: 4,7,8,9,12? [10:23:11] could be the ones that could potentially applied? [10:23:15] *apply [10:23:15] also metrics you can influence with "let's not report this as an incident" I don't like [10:23:39] marostegui: indeed [10:23:45] jynus: agreed, we should avoid metrics which can be easily gamed [10:24:21] I am not a particular fan of line 12, because external requests can be maaaany things, and sometimes they are not for us, or there are IRC/mail conversations and not tickets etc [10:24:34] yes, any external request is tricky [10:24:47] "average time since last mariadb start"... [10:24:51] marostegui: like yours, I was brain dumping :) [10:24:57] I think i see what you mean, but it's not necessarily the case that making that very short is agood thing, is it ;) [10:25:07] jynus: sure, just commenting on that, don't take it personal :) [10:25:16] that means you may be doing (security) upgrades well [10:25:21] or maybe mariadb is crashing every day [10:25:25] heh [10:25:36] neither very short nor very long is good ;) [10:26:21] well, it is a metric, as you said doesn't have to be min_max one :-) [10:27:01] I would go for uptime as line 10 / line 11 [10:27:07] in pct [10:27:34] Queries served in which way? [10:27:50] Maybe average query latency? [10:28:00] sum(mariadb_global_status) from prometheus [10:28:06] sum(mariadb_global_status_queries) from prometheus [10:28:10] :-) [10:29:19] I don't know, you decide [10:29:31] I don't know either :) [10:29:39] They all have good and bad things! [10:30:06] "average number of hours worked per week" [10:30:08] see the -foundations channel for some discussion around version upgrades [10:30:08] What about the version running path? do we want to explore that one? [10:30:12] there it's about debian versions [10:30:18] that's tricky too because they release like once a year or less [10:30:20] and then everything jumps [10:30:24] but I had a proposal for it [10:30:29] same for mariadb [10:30:37] and then you have like now [10:30:46] a 10.1.40 we are going to skip [10:30:50] on purpose [10:31:45] if it is a 1 year metric, we can track upgrades to 10.3/buster [10:31:52] which is 0 right now [10:32:30] technically 1 if we could that test host [10:33:46] yeah maybe buster/10.3.XX is a good one [10:34:12] so in the -foundations channel I said that you could define a statistical distribution for OS upgrades, in the OS upgrade policy [10:34:22] "all systems should be upgraded over a period of 4 years after a new release" [10:34:28] but that distribution is probably not even in their case [10:34:31] not uniform [10:34:35] probably more uniform for databases :) [10:34:45] so you could then track actual upgrades against that distribution in a metric [10:34:58] so if you go faster than the stats distribution you do well, slower than it then not so well [10:35:06] but it's tricky and a bit complicated [10:35:12] and what should the distribution be really [10:35:28] yeah, and might depends on external things too (ie: OS regression) [10:35:42] i think there will always be SOME external influences [10:35:45] and that's ok [10:35:53] i think we want teams to have a reasonable amount of control over [10:35:57] the problem those are good short term metrics [10:35:59] not necessarily 100% control, and not 0% control [10:36:09] like ok, no blockers now for X [10:36:21] but they are limited in scope [10:37:17] I think errors is ok, there will always be errors, and we can do something about it, and in the end users want less errors [10:37:49] e.g. we can reduce time for switchover [10:38:28] the problem is there will be months with the normal baseline of errors [10:38:41] and others where an incident creates millions of them [10:38:43] or the average query latency [10:38:57] but who has most control over that [10:38:59] you or the developers? [10:39:00] and they don't necessary correlate to user impact [10:39:11] I would say 50%50% [10:39:13] they are valid metrics to track [10:39:23] but not necessarily the best KPI for _this team_ [10:39:36] if you have to pick one [10:39:36] query latency I don't expect it to change [10:40:01] and if it changes, it will be because things like physical location of the top used servers [10:40:37] yes, there are long running queries, but they are 1 every 10 minutes vs the 300K we have every second [10:43:28] so far I think I like emergency depools best? [10:43:35] which also has user impact, it matters [10:43:40] it's something you have some control over [10:43:44] (not complete though...) [10:43:53] it's reasonably easy to measure over time [10:43:53] sure, but what does that mean? [10:43:56] Or master's running version? [10:43:58] and it shows a pain point [10:44:14] more manual stuff == worse? [10:44:18] yeah [10:44:51] i wouldn't say I love that metric, but it seems at least reasonable [10:44:58] what is the control we have about that? [10:45:08] as in, we do those [10:45:21] yeah it is a bit limited isn't it [10:45:28] building more redundancy is not easy just on your end [10:45:32] but we don't have a plan to work on that, for example [10:45:34] and you don't control hw failures [10:45:39] other than not letting things get super old [10:45:56] in the next year, I mean [10:45:59] fair [10:46:03] these are also longer term metrics btw [10:46:07] tracked up to 3-5 eyars [10:46:12] we need a target for next year and over 3-5 years [10:46:14] ok, then [10:46:20] so in that sense, we could show progress over a longer time [10:46:24] then manually handled incidents [10:46:27] is ok [10:46:30] yeah [10:46:35] not only depools [10:46:41] only problem is, we need to set a target so we need to know what is realistic for next year(s) [10:46:45] beacuse we may not even have depools at all at some pint [10:46:53] hehe [10:46:54] well that's ok [10:46:57] then the metric becomes perfect [10:47:01] haha [10:47:09] but I see what you mean [10:47:32] it is the same metric [10:47:40] just I am calling it genericly [10:48:17] for example, I guess it would include master switches? [10:48:23] (planned and unplanned) [10:48:29] because they are not automated? [10:48:54] technically it is a "master depool" [10:49:15] But I thought we said only "emergencies" one? [10:49:24] and something we aim to automate [10:49:36] ok, so incident requiring manual configuration changes [10:49:40] *incidents [10:49:46] like a manual depool [10:49:49] something along that lines I think [10:49:53] or an emergency master switch [10:50:00] yeah [10:50:12] technically ,we started automating a master (non emergency) switch [10:50:17] so it makes sense [10:51:06] I am ok with that [10:51:36] if we need another for backups, I am not sure about that one [10:52:23] I liked line 8, but we are not there yet [10:52:32] percentage of sucessful ones doesn't really cover well the intentions [10:52:32] successfully tested restoring backups [10:52:40] we have partial coverage, for example [10:53:05] if it is long term, I wouldn't mind that [10:53:17] but we aldo don't have that planned for next year [10:53:21] but in 2 years [10:54:31] yep [10:55:58] we can maybe redefine it in 1 year with testing [10:55:58] yes lets pick something we make progress on in the upcoming year [10:56:05] and fow now [10:56:10] we may pick new metrics next year or so [10:56:12] call it accordinf to monitoring [10:56:46] we don't have full testing, but we detect already simple errors and problems [10:56:56] and we do have those from time to time [10:57:01] yeah that's ok [10:57:07] i think we had a similar metric last year planned [10:57:07] and they are a metric of our backup reliability [10:57:18] that WFM [10:57:51] and in a year, when we have full testing with the hw we just bought, we got stricter [10:58:00] *get [10:58:18] sounds good [10:58:29] we just need a baseline to set a target for in 1 year [10:58:32] I reworded line 2 and 9 below with the discussin, marostegui [10:58:34] for both metrics [10:58:54] marostegui: see if they are ok, or prefer the original, etc. [10:59:16] mark: backups or database backups? [10:59:35] well [10:59:36] ideally backups [10:59:40] but if that's not realistic [10:59:43] then database backups is ok for now? [10:59:45] I know, but we don't have metrics of those [10:59:47] now [10:59:48] yeah [10:59:55] then let's take db backups for now [10:59:56] but I intend to change that too :-D [11:00:00] yes [11:00:40] jynus: i am ok with the reordering [11:00:49] I think the 2 are not perfect but are good enough [11:01:24] I changed "generated correctly" to "generated without errors" [11:01:30] a bit more specific [11:01:36] ok if I remove all the other lines which we won't pick? [11:01:53] yeah move them aside [11:01:58] they're still nice ideas to keep around [11:02:17] so [11:02:24] is one of you able to set a target for number of emergency depools today? [11:02:30] we need an estimate for next fiscal year [11:02:37] as a target [11:02:38] buf [11:02:44] let me check historics [11:02:56] hmm [11:02:59] for backups what would be the goal, 99% [11:03:01] and "manual configuration changes" [11:03:05] i don't like that wording either [11:03:09] if we automate it like the current goal [11:03:14] then that doesn't fully apply [11:03:23] I thought that was the point? [11:03:35] emergency depools on which timeframe? current FY? [11:03:35] if things are handled automatically == good [11:03:39] well [11:03:44] they are not entirely automatic [11:03:47] you are still manually depooling [11:03:52] just through an easier process [11:03:54] that is good [11:04:00] how about [11:04:09] "Number of emergency manual replication topology changes"? [11:04:12] hm no [11:04:15] that doesn't do it [11:04:20] depooling is not a toplogy change [11:04:25] exactly [11:04:34] we can call it by its name [11:04:46] manual replica depools + master failovers [11:04:56] i guess so [11:04:57] those are I think most of them [11:04:57] under emergencies, right? [11:05:00] yeah [11:05:01] yeah [11:05:01] not planned [11:05:04] ok [11:05:36] e.g. let's not add planned and slow hardware upgrades [11:05:40] so they're asking for target numbers "in 1 year" "in 3-5 years" [11:05:44] agreed [11:06:20] 90% I am going to say in 1 year [11:06:26] for backups [11:06:34] 99% in 5? [11:06:42] 99.something [11:06:53] 90 and 98 [11:07:00] for the other [11:07:12] do we have 10% failing backup runs today? [11:07:26] we have less than 10 backups (20 on both datacenters) [11:07:35] 1 failing every time is quite common [11:07:47] in current heavy development [11:07:52] ok [11:07:56] but would that still be the case 1 year from now? [11:07:56] all failing is quite common [11:08:04] no the intention [11:08:06] *not [11:08:14] assuming staffing etc [11:08:17] but on software upgrade [11:08:24] the most common result is all failing [11:08:27] and retry [11:08:33] the question is what is a backup failing? [11:08:39] "without errors [11:08:40] " [11:08:41] is what it says now [11:08:52] and failure is common, and then it should justy retry [11:08:53] we could make that "critical errors" (not warnings or something) [11:09:03] really it should mean "is the backup usable" [11:09:07] if a retry is not an error [11:09:08] but yeah we can't actually test that yet [11:09:15] it should be much closer to 100% [11:09:20] i think if a retry happens within a reasonable amount of time that's ok [11:09:22] and is not an error [11:09:28] agreed [11:09:32] if it succeeds that is ;) [11:09:44] yeah, retry is in may backlog [11:10:00] we do it manually has to be automated [11:10:00] so is say 95% and 99% reasonable for 1 year and 3-5 year? [11:10:11] or even higher than 99% ideally [11:10:11] but yeah [11:10:23] i don't know if it's realistic :) [11:10:31] yeah, lots of unknowns [11:10:35] yes :( [11:10:40] beacause we are building it [11:10:47] more backups will happen [11:10:51] more backup types [11:10:53] etc [11:11:15] maybe we can change these metrics later still, this is just the first deadline I think [11:11:16] including ES [11:11:19] which will be tricky [11:11:19] but is say 95% and 99% ok for now? [11:11:22] right [11:11:27] let's say that [11:11:56] then we should find a definition that could work longterm [11:12:03] ok [11:12:05] i will put this in for now [11:12:22] for failovers [11:12:52] it depends, we had 2 at the same time before previous switch dc [11:13:05] and we had 2 in the last month, right? [11:13:11] yeah [11:13:32] 10 in a year? [11:13:39] unplanned? [11:13:43] yeah [11:13:46] yeah [11:13:51] sounds about right [11:14:11] we didn't have a master failure in a long time [11:14:12] I recall 4 in the last 5 months [11:14:27] don't jinx it!! [11:15:00] oh god [11:15:03] i fear upcoming weekend [11:15:08] hahaha [11:16:27]