[17:18:14] herron, shdubsh: hey -- was talking to godog intermittently today, about our Q1 goals [17:18:29] there are some stuff on the pad right now, do you guys think these make sense? [17:19:00] looked good last time I checked, having another look now [17:19:28] any other commitments that you guys see for Q1? [17:19:53] and safe to assume that all three of you would be working on both of these goals? [17:20:08] and heads-up I'd like to have a chat sometime in the next two weeks to talk about our goals a little bit [17:20:13] was asking godog some difficult questions today :) [17:23:51] objective at line 217 seems doable [17:24:02] yes looks good to me! huge fan in particular of lines 206-210 [17:25:28] aye, old friend statsd is back! [17:26:43] wow nice [17:27:09] I’m most interested in the improved alerting offering, but am game to get my hands dirty on graphite sunset as well [17:27:10] how would yall feel about also adding "acknowledge alerts from IRC"? :) [17:28:26] heh, ATM "would be awesome, not sure how to implement sanely" [17:29:06] I just mention it because I saw "Downtime hosts from IRC" although that is somewhat easier [17:29:42] reminds me of bitlbee twitter gateway giving messages short ids so you could @reply [17:29:51] good times [17:30:18] objective at 196 has a lot of moving parts and looks the most daunting to me. curious to know how others feel though [17:30:19] I feel like the ack idea is good, but might be premature since IMO should happen at an aggregation layer so more than just icinga alerts can be acked [17:30:31] that's completely fair herron [17:36:51] ok I am logging off, have a good one! [17:37:25] yeah I was just wondering about the ack thing myself [17:37:33] also saw XioNoX's ping on #-operations, felt relevant :) [17:37:57] shdubsh: are you talking about the entire alerting goal or some sub-objective? [17:39:31] dunno if it's worth a goal, but making it easy for icinga alerts to open tasks might be a low hanging fruit to reduce alerting noise [17:39:36] the entire alerting goal. [17:40:03] XioNoX: +1 [17:40:31] (although there's still something of an 🐘 there around a possible new aggregation layer) [17:40:57] shdubsh: what looks daunting to you and why? [17:41:50] initial gut reaction is EOQ output being two documents, two pieces of software, and a configuration refactor. [17:43:31] but that's just my estimation. I'd like to hear from others. [17:45:18] what do you mean by two documents? [17:46:00] and two pieces of software? [17:46:41] documents from line 199 and 204. software from lines 209 and 210. [17:47:31] 204 isn't as much as a document as a task/list [17:47:45] 210 I don't know if it's a software fix, could be an alert tweak [17:47:53] 209... yeah requires some coding [17:48:57] 210 might just be my approach then. I'd be looking to some sort of report aggregator and alert against that rather than a per-host basis [17:50:58] I *think* one approach discussed for 210 was to ship puppet run data to prometheus and set up an alert on the overall trend there? paraphrasing but something like that [17:51:03] it's a valid idea, but I'd also like to understand why we're getting so many puppet alerts [17:51:20] maybe good to clarify the scope on that since yea could be solved many diff ways [17:52:07] jbond.42 had a plausible theory about the alerts that come and go [17:53:03] there's the ones that come and go (and two different flavors of them; the "zero resources" ones started more recently), and then there's the "oops I pushed something bad that CI/PCC didn't check and now it's erroring on every host" (which happens let's say about 1x/month?) [17:53:21] but to address the "thundering herd" effect of a broken merge would require some kind of aggregator [17:55:53] yeah, or a few lines of code to the bot :) [17:55:53] one near-term option on puppet is to alert once if >N hosts had puppet agent failures in recent time frame according to prometheus (and turn down the current individual host icinga alerts, but maybe leave in dashboard?) [17:55:59] or that yeah [17:56:53] herron: "don't announce to IRC; open ticket if they remain in failed state for more than N hours" (possibly that second part is best handled by another rule) [17:57:08] I don't think we need to hash it out now, but figuring out a way to cut down on the noise in the next 3 months sounds like a worthy goal [17:57:34] FWIW I don't think it would be that hard to get the puppet run data in prometheus; there are some simple ways to do it [17:58:01] should we qualify/scope line 210 by appending something like “using current tooling”, or do you want to leave that open ended as well? [17:58:14] paravoid: ^ [17:58:27] open-ended in the goal language sounds more flexible, but I'm game either way [17:58:31] kk [17:59:41] fyi, "Check systemd state" is the #2 most noisy alert after the "puppet last run" one [18:01:52] my hunch is that's because it's much more generic, not because there's nearly as many false positives? [18:02:33] in any case, a bit daunting might not be a bad thing. The worst that could happen is "darn, missed that one" [18:02:40] well [18:02:42] yes and no :) [18:02:55] I'd like to get a bit better at our goals in FY19-20 [18:03:09] as I was telling go.dog earlier [18:03:21] lines 219-221 about statsd [18:03:56] are really very similar (almost identical in scope but not size) to https://phabricator.wikimedia.org/T205870 [18:04:01] which was a Q2 goal [18:04:15] which is still not complete -- and actually pretty far from complete [18:04:26] so that is a bit disappointing [18:04:55] also of note is that we made the varnish bits out of that (a tiny portion) into a Q4 goal, and didn't make that either [18:34:46] shdubsh: ill try and update the task with my musing possible issues and fixes re puppet errors' [18:35:25] thanks! [18:36:23] s/try/will/ :) [22:03:07] I'm late to the goal discussion... but fwiw I agree we should do also some effor to fix the random puppet errors in the first place. For the random ones look more at it and try to fix them. I did a quick debug a while ago in https://phabricator.wikimedia.org/T201247#4647722 [22:03:22] we've too many of those to be acceptable IMHO. [22:04:01] For the bad-patch ones, have automatic puppet compiler on CI might prevent a bunch of them (not all as some fail on the agent side) [22:04:07] my 2 cents :)