[09:15:12] hey all [09:15:23] any new thoughts on a metric for this team? [09:16:15] <_joe_> time spent debugging wiki templates with a new php engine :D [09:17:37] I looked yesterday briefly but nothing has come to mind yet heh [09:21:54] i don't think this team spends a lot of time debugging wiki templates [09:22:07] so they're doing awesome on that front [09:23:06] anyway, the deadline is today, and the risk is that I'll need to make up a metric for you that may haunt you throughout the next year... ;) [09:26:49] I'd be in favour of picking "Number of Debian Linux systems running on EOL releases" and "Number of Spicerack cook books automating maintenance tasks", while this skips the whole observability work, I think it covers the team's scope still pretty well? [09:28:15] observability has a separate KD in the plan anyway [09:28:20] yeah, I think it does [09:28:34] the only thing I dislike about the first metric (which I came up with myself) is that it doesn't really indicate any progress [09:28:38] it's not very inspiring that way [09:28:49] that shows how we're "sustaining" basic maintenance, i.e. keeping things up to date [09:29:07] but it would be nice to have an indicator of some core work progress as well [09:29:26] people already see much of our work as basic "maintenance", and we have trouble showing how we're evolving things, making things better [09:29:31] this metric would not help with that [09:29:34] see what I mean? [09:29:44] the other one does indicate that [09:30:09] even if we could have a metric indicating we're possibly getting _better_ (faster) at upgrading distribution versions or something [09:30:13] that would be nice [09:31:03] aggregated runtime of all debian systems since their debian release? ;) [09:31:13] I see what you mean. but OTOH some tasks are inherently "uphill battles" with new distros coming in all the time, you mentioned some team has a metric on Node 10 which isn't much different in scope, isn't it? [09:31:14] and then need to factor out the number of systems hehe [09:31:30] i don't know they will have a metric about it [09:31:36] I have another (cross-SRE) metric which is this: [09:31:48] Percentage Python LOC able to run on Python 3 under operations git repositories [09:31:51] still in draft [09:32:15] but yeah [09:32:23] with debian releases not being frequent enough it is tricky [09:33:42] there's also the issue that the changes are not visible that often, if we a metric like "% of systems not running Debian oldstable", that number wouldbe at ~ 15% now, but would jump to 99% in 1-2 months [09:34:10] so, that's in fact probably not a good metric either way [09:34:23] agreed [09:35:47] the major benefit of matching OS versions is that is maps reality more precisely, as it touches all our services and reflects a major chunk of our work spent in SRE. if we were just after chasing metrics we could simply write cook books and not actually use them :-) [09:35:57] yup [09:36:02] I'll give it some more though if there's something better [09:38:34] maybe something like [09:38:56] "we should upgrade all systems to a new debian release within X months of it being available" [09:39:07] and then tracking how we're doing relative to that target [09:39:14] so if we upgrade many systems very fast in the first month [09:39:25] then that metric goes positive [09:39:31] but yeah still tricky... [09:39:40] is something like that in the OS upgrade policy? [09:39:44] * mark still has to read it [09:41:00] i think something referring to that document is probably a nice idea [09:41:16] the OS upgrade policy essentially states that we only use two Debian releases at a time and don't max out the five years of life time [09:41:50] as this leads to issues like missing deadlines in the end (labstore1003 is currently running an EOLed OS still) [09:42:04] yes [09:42:08] and minimises overhead as we only need to builds packages like puppet for two distros instead of three [09:42:19] so maybe some metric relative to that 4 year lifetime [09:42:24] but yeah, still difficult to get away from the big jumps [09:43:35] or you could do something like "over a period of X months after a new debian release, we should be upgrading all servers, and assuming a certain distribution of upgrades, we are doing ahead or behind" [09:43:38] but it gets complicated fast ;) [09:44:14] distribution being statistical distribution there [09:44:41] assuming it's not a uniform distribution ;) [09:45:06] yeah, I doubt that's good slide material :-) [09:45:32] mark: for "observability as a service" I was thinking something like "number of notifications (irc, page, email) per quarter received by sre that were for other teams" to be reworded, but you get the idea. I'm thinking at least wmcs, analytics, performance all have icinga alerts we get notified for but shouldn't really [09:47:18] godog: why shouldn't we get notified for them? [09:47:56] at least for alerts that indicate service unavailability [09:47:59] (maybe some don't) [09:48:18] moritzm: oh i dunno [09:48:23] could make a nice graph out of that [09:48:51] showing an expected distribution of upgrades after a new debian release (a distribution line) and how many we have done actually (bargraph) [09:49:03] damnit now I'm starting to geek out on metrics [09:50:04] mark: yeah general service unavailability for sure, I'm thinking from wmcs host down that now pages to performance team grafana alerts that show up in icinga [09:50:24] godog: wmcs host downs means user visible downtime, no? [09:50:41] VMs down and what not [09:50:45] why wouldn't we need to receive those pages? [09:51:05] yes it does, we means sre or wmcs ? [09:51:15] SRE [09:51:59] I was under the impression wmcs should/would receive those, maybe I'm mistaken [09:52:08] they should, and we should too I think? [09:52:15] anyway [09:52:26] if we wanted to change that that's something we should discuss, e.g. at the summit [09:52:38] until then maybe not something we should put in a metric [09:53:00] other than that I don't think it's an awesome metric for indicating progress for core work of the observability team [09:53:22] you could simply go in puppet and ensure you receive no pages whatsoever anymore for anything possibly owned by another team [09:53:27] and you would do awesome on that metric [09:53:29] but is that real progress? [09:53:30] idk :) [09:54:22] I think we're assuming good faith, but I see what you mean with not being an awesome measure of progress yeah :) [09:54:36] sure, assuming good faith is needed, you can game most metrics [09:54:37] but yeah [09:54:51] one other issue with making a metric for the "quickness of rollouts" is large out of hands for SRE IF, per the new policy within the first year the rollout is at the discretion of the service owners and after a year we coordinate the stragglers. we make sure we don't run anything unsupported/insecure, but it's not for us to control when e.g. Mediawiki servers are migrated etc. [09:55:11] moritzm: yeah [09:55:17] there's always some of that, parts you can't control [09:55:26] we should be picking metrics we hope to have reasonable control over [09:55:31] and where we don't we can communicate why [09:55:54] moritzm: i guess to handle the case of the first year, you could define a statistical distribution that doesn't expect many upgrades in that period [09:56:04] and where they do happen that just increases Awesomeness on that metric ;) [09:58:46] also, the metrics are not gonna be really (sub) team specific in the end [09:59:09] they will be tracked for the entire organization against an outcome ("Sustaining and evolving Wikimedia's essential technical infrastructure") [09:59:26] so not really just infra foundations being responsible for it [09:59:35] but it's true, I am now asking each sub team to come up with a metric relevant to their core work ;) [10:12:04] maybe we should actually settle on only on a single metric and the number of cookbooks; the more I think about if, I don't think OS releases make for a good metric: "% of systems running an unsupported OS" has a clear rationale and avoids the issue of spikes caused by release dates, but lacks the "show things improve factor", but metrics the stat distribution of updates before they are EOL doesn't seem that [10:12:06] meaningful either (migrating a single mailman server or things like dbstore1002 (which was a Q goal by itself) can easily be more complex that upgrading e.g. dozens of Kubernetes workers [10:12:45] that is true [10:12:50] or spicerack and py2/py3 migratoo [10:13:01] which is also easy to explain [10:13:05] then again [10:13:19] the fact that large clusters become easier is ALSO because that is getting increasingly automated ;) [10:13:43] and we haven't really thought about OS distributions in the deployment pipeline (kubernetes) either [10:13:54] which brings us back to the spicerack metric :-) [10:14:02] yes [10:14:10] so spicerack is covered under another outcome too [10:14:22] the "Address Infrastructure Gaps" one [10:14:26] which isn't really TEC1 like [10:14:39] it's always hard to know where to draw the lines... [12:06:04] * paravoid waves [12:06:15] I just arrived back home [12:13:34] just in time for metrics faidon! :-) [12:13:43] what a coincidence [12:13:57] * paravoid grumbles [13:08:18] we can make a metric around number of goal metrics that we feel connect to the reality of our day-to-day work [13:08:37] and set a goal to improve that 50% year-over-year [13:13:04] * godog chuckles [13:15:06] i would honestly love some more constructive proposals [13:18:58] yeah, it is a difficult one [13:22:49] so what was the latest on the distro metric? [13:22:51] moritzm: ^ [13:23:08] I like having something to track the implementation of our newfound policy [13:27:25] i wonder if/how we could graph data of previous os upgrades over time [13:27:34] to see what an upgrade curve/distribution has looked like in the past [13:27:52] I don't think we have that data anymore... [13:27:53] then you could put that in your doc, and define the metric relative to that [13:27:55] yeah [13:28:12] this was one of the puppetdb/grafana ideas [13:28:17] we might be able to reconstruct it from git history of the netboot files? [13:28:42] nah, there's the default [13:31:02] how about "% of servers running oldoldstable" [13:31:25] with an aim of this to be 0, but we could also have 1% as the target or something [13:31:39] it won't be linear, it'll jump when new distributions get released [13:32:55] we discussed that above [13:33:08] that's the problem, which is why i proposed doing it relative to an expected upgrade curve [13:33:27] and not just jumping, also jumping what, once a year or less [13:33:32] so that's not great for an annual metric ;) [13:33:38] especially when you track it each quarter potentially [13:34:29] or maybe rather % of hosts not following our 4 yo lifetime in the OS policy? [13:34:44] that jumps as well [13:36:18] yeah, everything based on OS releases will jump, except maybe the original "Number of Debian Linux systems running on EOL releases" proposal [13:36:28] no, what I proposed does not jump [13:38:02] I think having a progress of upgrades and tracking that is a good idea [13:38:07] but it's hard to calculate [13:38:35] and then easy to miss e.g. a schedule upgrade of 400 mw* :) [13:39:07] (or kube* :) [13:39:10] yeah that sort of thing would cause spikes [13:39:35] could we do something with "distribution age" perhaps? [13:39:39] how about [13:39:45] now - (release date of $distro) [13:39:49] modulo identical puppet configuration [13:40:01] so everything that has the exact same puppet config counts for 1 [13:40:14] paravoid: i proposed that before too [13:40:23] see backlog ;) [13:40:25] but that still jumps [13:40:26] sorry [13:40:33] why would it jump? [13:41:16] do you mean calculating the average time of the new release you're upgrading to? [13:41:26] average age [13:41:43] yeah, average or median or maximum or something :) [13:41:49] i guess maybe that could work [13:42:01] so when we do an upgrade, we add the age of the distribution at that point in time in [13:42:33] relative to the time of the release of that distro? [13:42:40] yeah [13:42:40] yes, the age of the debian release [13:42:46] major release [13:42:47] of the first point release specifically :) [13:42:54] hehe yeah [13:43:05] seems relatively easy to implement [13:43:08] can even be done automatically [13:43:40] "99p of software distribution age is 4 years" [13:43:41] first time $tool sees a new OS release for a server it looks at the age of that release and adds it in [13:44:34] yeah, that could work [13:44:37] +1 [13:46:21] the OS policy (while not officially acked, I've asked Joel to add it to the agenda for Dublin) only applies for stretch onwards, so current stats (if collected by 1 July) will not follow that, but that's probably just fine as it will show later how the policy help fix things [13:48:05] official acked by whom? :) [13:53:49] that's the question :-) but if noone objects during the offsite, I'd consider it officially acked by SRE at large [14:05:42] *stamp* [14:06:51] "99p of software distribution age (years)" with a target of < 4 is one option [14:06:56] the other option is the inverse of that [14:07:28] "Servers in production running a > 4 year old software distribution (%)", with a target of < 1% or whatever [14:07:44] that's a bit easier to grasp I think [14:09:07] software distribution is a little vague, maybe better "Linux distribution", "Debian release" or "operating system"? [14:09:23] fair [14:09:39] sorry to disturb, but is a Debian release going to happen soon? (if you have follow the current state) [14:09:42] and do we include the OSes found in containers? [14:09:50] jynus: june or thereabouts [14:09:57] paravoid: thanks, that is helpful! [14:09:58] jynus: maybe 6 six weeks I'd guess [14:10:10] thanks [14:10:27] some blockers around Secure Boot currently being worked on, but all on track [14:10:32] moritzm: "servers" so I guess not [14:10:43] we can track that next year :P [14:10:48] ack :-) [14:11:25] I prefer the second variant (with the > 4 yo formulation) [14:11:37] with all the jessies it's 12% now [14:12:05] should I aim for a 0.5% medium-term and 2% by next FY? [14:12:34] that would be <= 27 servers with jessie by end of July 2020 [14:12:41] down from 173 [14:13:09] when is the first target date for these new metrics? [14:13:25] starting with 1st of July or now? [14:13:34] the former [14:14:13] yeah, 2% sounds fine [14:14:34] alright [14:15:00] and 0.5% (i.e. ~7 by current counts) for the medium-term [14:15:23] ack, there'll always be the usual labstore1003 type of things [14:16:20] yeah [14:17:25] alright [14:17:26] done :) [14:18:04] cool! there was also the other discussion about making the Py 2->3 migration a metric, did you see that? [14:18:17] I saw a metric, didn't follow the whole discussion [14:18:22] sorry, it's been a rough few days [14:18:46] happened in #wikimedia-sre-private, but it's not obvious to me if we still need it at this point or not as several other metrics were collected in parallel [14:18:47] I'm jetlagged, tired and have a fever :) [14:19:15] ouch, every single one of that is incompatible with metrics discussions :-) [14:22:28] (moving the py3 convo there again) [14:22:38] moritzm: unrelated -- what's with an-tool1005? [14:22:45] luca said you were running some tests or something? [14:24:52] the d-i installer for buster is currently still affected by a bug which makes the addition of an external repo failing (due to changes in gnupg), I submitted a MR request to salsa and had been using this host for testing, but there has been radio silence in the BTS, if that host is now needed or if it messes with Netbox states, I can complete the installation with my manual workaround [14:26:24] moritzm: that reminds me, graphite200[12] are free again now for buster tests (or decom, when done with tests) [14:27:20] moritzm: it messes with netbox states, but that's because it's set to "up" in Ganeti [14:27:40] if you shut it down, netbox will stop complaining [14:28:09] it's half-installed and was stuck in d-i for SSH debug access [14:28:11] basically netbox report alerts us that there is a VM that is running but that puppet doesn't recognize [14:28:25] I'll shut it down in Ganeti and will use graphite2001 for further tests [14:28:36] heh [14:28:43] it's ok, it's good to find corner cases [14:29:03] the point is to make the netbox stuff match our workflows, not the other way around :) [14:29:27] ack, nice [14:30:35] moritzm: this is for https://netbox.wikimedia.org/extras/reports/puppetdb.PuppetDB/ btw [14:31:01] chaomodus: on that note, wmf7622 seems to warn, apparently the report tries to find a Status: Failed host in PuppetDB [14:39:55] an-tool1005 is shutdown now [15:16:47] paravoid: thanks [15:46:40] * volans|off late for the metric party [15:47:42] and if instead of the distros we measure some sort of average time to apply OS security fixes? it basically imply keeping the OS up-to-date given that after they're too old they don't get anymore sec updates [15:47:52] cc moritzm ^^^ [15:57:39] chaomodus: what's the rationale for https://gerrit.wikimedia.org/r/#/c/509431/ I can see both pro and cons [16:02:11] we're good with the current metric I think [16:17:24] volans|off: Failed state is meant to be offline (eg not in puppet) if I'm understanding correctly. I had literally not encountered it until now though [16:17:42] I may be inferring too much though [16:19:12] Oh yah according to lifecycle it is meant to be removed from production before receiving that state [16:19:18] so I think my inferrence is correct [16:19:59] remove from production means depooled and possible power off if needed [16:20:03] will not be removed from puppet [16:20:08] ah [16:20:09] hm [16:20:26] it could be one of tnhose states that is just ignored, not alerted [16:20:27] although we had an auto-remove from puppetdb for hosts not reporting after X days, I don't recall if we removed it [16:21:40] maybe it shouldn't imply a disposition at all in these reports [16:22:31] I think it should be reported as it was, not ignored [16:22:51] does that make sense though? [16:24:02] that's implying that a Failed host must be in PuppetDB [16:24:34] although as it currently is it implies that a Failed host must not be in PuppetDB [16:24:51] both seem incorrect to me [16:26:05] tbh it makes more sense to me if it alerts on failed state if the change to failed is older than N days [16:26:19] (failed state being in puppetdb) because that's like it got lost in process [16:27:01] we had hosts in failed state for months due to issues with the vendor [16:27:11] Mh [16:27:27] So should the report sit alerting if they are or are not in puppet? [16:28:05] since https://gerrit.wikimedia.org/r/c/operations/puppet/+/346110 has been merged as part of https://phabricator.wikimedia.org/T159163 [16:28:17] I think so, all failed hosts should be in puppetdb, I don't see why not [16:28:22] Okay [16:34:38] i think the point of question here is that something seems to have reached failed state without being in puppet [16:35:05] https://phabricator.wikimedia.org/T222922 [16:35:36] although of course tehre's no Spare->Failed [16:38:52] pure spares are powered off, so hard to fail when shutdown [16:39:02] if they are commissioned they go through the staged, etc... [16:39:23] right in this case it was marked failed because it would not start up [16:39:55] not sure if we need a spare->failed, ask dcops :) [16:40:04] hah okay [16:40:39] failed has a meaning for things that were active, hence in prod [16:41:06] if something not in prod fails, it could be considered as a blocker to go into prod [16:41:18] multiple POV availables [16:44:20] mmh [16:44:37] bug in netbox need failedfromspare state [16:44:52] ;)