[12:49:56] hi team [12:50:08] where are with goal proposals? [12:50:23] let's start filling this pad (line 96 onwards) with some draft language [12:50:26] can I help? [13:10:23] I'm looking at the metrics goals ATM btw [13:11:48] thx [13:14:51] an aspirational outcome I was thinking for that is "migrate mediawiki stats to prometheus" although Timo's comment in https://gerrit.wikimedia.org/r/c/operations/puppet/+/481110#message-9ebee6cef9d99c6b3b4b1c310d251a7b52d96e54 makes sense as in it'd be a performance/cpt/us shared goal [13:16:38] I'm pretty sure the design for that would be non-trivial, right? just because it involves sharing any state at all between PHP worker processes [13:18:16] yeah that's one consideration for sure [13:19:02] another one is that even with the current design with statsd metrics which sidesteps that issue there's likely an overhaul needed [13:19:10] of the mediawiki stats that is [13:19:34] overhaul in terms of, like... i guess one could say the schema of the metrics themselves? [13:20:45] yeah I think so [13:27:36] I want to have some goal for external monitoring, but I'm not at all sure what it should be for this quarter. this Q we made the Catchpoint decision but haven't done the design work we wanted to do [13:30:24] we also did the basic icinga meta-monitoring (out of goal technically) ;) [13:31:32] nice use of 'we' volans ;) [13:32:26] reviews are important too ;) [13:34:28] btw paravoid I've 'backed up' our Catchpoint configuration (all the defined probes, and the selenium login+edit script) as well as is easy to do [13:34:36] cool [13:34:40] I will find a home for that stuff on officewiki today [13:34:43] so can I tell them we won't renew? [13:34:47] LGTM [13:36:41] anyone have any pointers on the work that's been done on the database automation goal so far? the eventual idea is to have that PHP config file that drives database server mappings live in conftool instead, right? [13:54:00] cdanis: https://phabricator.wikimedia.org/T197126 [13:54:34] including a couple of CRs pending my review [13:55:10] hah :) thanks! [14:19:34] paravoid: FYI, there's now some goal language for the auth* changes in the pad (we've also shared the gdoc with some more fine-grained thoughts/plans with you) [14:20:08] ack thanks [14:22:36] iasd~. [14:24:53] paravoid: spoke a bit with Andrew and updated T217359. I think language for the Q4 kafka goal depends on a) some consensus that this approach is sensible and if so b) timing associated with budget for those hosts [14:26:39] re: budget [14:26:51] elukey is right that our refresh cycle is 5 years, not end of warranty (which is at 3 years) [14:26:56] so these hosts are not slated for a replacement yet [14:26:57] *but* [14:27:14] this year's budget, has two entries in Q4 for: [14:27:18] eqiad: kafka main cluster expansion [14:27:21] and codfw: kafka main cluster expansion [14:27:37] so there is budget to *grow*, which usually means add servers [14:32:24] gotcha, would upgrading the existing server specs (adding ram, etc.) fall into “grow” as well? and does warranty status have any bearing on that? [14:42:15] what happens to the warranty once the 3 years is up? it needs to refresh? [14:44:17] after three years we no longer get spare parts covered by warranty, so if the hw breaks it gets replaced e.g. like https://phabricator.wikimedia.org/T215415#5032858 [14:44:40] or of it's something bigger (broken mainboard or so), sometimes the whole server is decommissioned [14:44:48] ahh i see [14:45:00] so it basically becomes our problem completely [14:52:14] so i gues what is the deliverables for the kafka adoption [15:10:21] we typically don't do upgrade of server specs, that's messy [15:10:28] especially on out of warranty hardware [15:12:43] so it simply gets replaced, that makes sense [15:39:10] ok, and mixing hardware spec within a kafka cluster seems messy as well. which brings me back to the idea of performing the expansion and refresh at the same time, maybe not in Q4? [15:40:29] eh [15:40:37] the first question is: do we need to expand? [15:40:45] and if so, to what? [15:55:50] maybe the goal (or at least the first step) is 'capacity planning'...? [15:55:57] afaict and from talking to otto there isn’t urgency to expand in Q4 specifically but he’s expecing traffic to kafka-main to increase over the next year with the modern event platform program. his suggestion as well was hw refresh/upgrade when possible either Q4 or next FY [15:56:25] yeah that makes sense, and documentation/knowledge transfer. clarify the scope of what sre is supporting? [15:57:03] the Q4 $ will go away if we don't spent it :) [15:57:09] we can make a new budget request next FY [15:57:10] yeah, "handoff to SRE" [15:57:16] but it'd be a missed opportunity to not use it [15:57:38] why is a cluster with mixed generations of hardware considered a problem? [15:57:47] lol thanks to Google for the double-paste [15:58:02] (unrelated, sorry) [15:58:05] it’s more a cluster of mixed specs. new hosts with 64 or 128G ram and old hosts with 32