[17:18:14] <paravoid>	 herron, shdubsh: hey -- was talking to godog intermittently today, about our Q1 goals
[17:18:29] <paravoid>	 there are some stuff on the pad right now, do you guys think these make sense?
[17:19:00] <herron>	 looked good last time I checked, having another look now
[17:19:28] <paravoid>	 any other commitments that you guys see for Q1?
[17:19:53] <paravoid>	 and safe to assume that all three of you would be working on both of these goals?
[17:20:08] <paravoid>	 and heads-up I'd like to have a chat sometime in the next two weeks to talk about our goals a little bit
[17:20:13] <paravoid>	 was asking godog some difficult questions today :)
[17:23:51] <shdubsh>	 objective at line 217 seems doable
[17:24:02] <herron>	 yes looks good to me! huge fan in particular of lines 206-210
[17:25:28] <godog>	 aye, old friend statsd is back!
[17:26:43] <cdanis>	 wow nice
[17:27:09] <herron>	 I’m most interested in the improved alerting offering, but am game to get my hands dirty on graphite sunset as well
[17:27:10] <cdanis>	 how would yall feel about also adding "acknowledge alerts from IRC"? :)
[17:28:26] <godog>	 heh, ATM "would be awesome, not sure how to implement sanely"
[17:29:06] <cdanis>	 I just mention it because I saw "Downtime hosts from IRC" although that is somewhat easier
[17:29:42] <godog>	 reminds me of bitlbee twitter gateway giving messages short ids so you could @reply
[17:29:51] <godog>	 good times
[17:30:18] <shdubsh>	 objective at 196 has a lot of moving parts and looks the most daunting to me.  curious to know how others feel though
[17:30:19] <herron>	 I feel like the ack idea is good, but might be premature since IMO should happen at an aggregation layer so more than just icinga alerts can be acked
[17:30:31] <cdanis>	 that's completely fair herron
[17:36:51] <godog>	 ok I am logging off, have a good one!
[17:37:25] <paravoid>	 yeah I was just wondering about the ack thing myself
[17:37:33] <paravoid>	 also saw XioNoX's ping on #-operations, felt relevant :)
[17:37:57] <paravoid>	 shdubsh: are you talking about the entire alerting goal or some sub-objective?
[17:39:31] <XioNoX>	 dunno if it's worth a goal, but making it easy for icinga alerts to open tasks might be a low hanging fruit to reduce alerting noise
[17:39:36] <shdubsh>	 the entire alerting goal.  
[17:40:03] <cdanis>	 XioNoX: +1
[17:40:31] <cdanis>	 (although there's still something of an 🐘 there around a possible new aggregation layer)
[17:40:57] <paravoid>	 shdubsh: what looks daunting to you and why?
[17:41:50] <shdubsh>	 initial gut reaction is EOQ output being two documents, two pieces of software, and a configuration refactor.
[17:43:31] <shdubsh>	 but that's just my estimation.  I'd like to hear from others.
[17:45:18] <paravoid>	 what do you mean by two documents?
[17:46:00] <paravoid>	 and two pieces of software?
[17:46:41] <shdubsh>	 documents from line 199 and 204.  software from lines 209 and 210.
[17:47:31] <paravoid>	 204 isn't as much as a document as a task/list
[17:47:45] <paravoid>	 210 I don't know if it's a software fix, could be an alert tweak
[17:47:53] <paravoid>	 209... yeah requires some coding
[17:48:57] <shdubsh>	 210 might just be my approach then.  I'd be looking to some sort of report aggregator and alert against that rather than a per-host basis
[17:50:58] <herron>	 I *think* one approach discussed for 210  was to ship puppet run data to prometheus and set up an alert on the overall trend there?  paraphrasing but something like that
[17:51:03] <paravoid>	 it's a valid idea, but I'd also like to understand why we're getting so many puppet alerts
[17:51:20] <herron>	 maybe good to clarify the scope on that since yea could be solved many diff ways
[17:52:07] <shdubsh>	 jbond.42 had a plausible theory about the alerts that come and go
[17:53:03] <cdanis>	 there's the ones that come and go (and two different flavors of them; the "zero resources" ones started more recently), and then there's the "oops I pushed something bad that CI/PCC didn't check and now it's erroring on every host" (which happens let's say about 1x/month?)
[17:53:21] <shdubsh>	 but to address the "thundering herd" effect of a broken merge would require some kind of aggregator
[17:55:53] <paravoid>	 yeah, or a few lines of code to the bot :)
[17:55:53] <herron>	 one near-term option on puppet is to alert once if >N hosts had puppet agent failures in recent time frame according to prometheus (and turn down the current individual host icinga alerts, but maybe leave in dashboard?)
[17:55:59] <paravoid>	 or that yeah
[17:56:53] <cdanis>	 herron: "don't announce to IRC; open ticket if they remain in failed state for more than N hours" (possibly that second part is best handled by another rule)
[17:57:08] <paravoid>	 I don't think we need to hash it out now, but figuring out a way to cut down on the noise in the next 3 months sounds like a worthy goal
[17:57:34] <cdanis>	 FWIW I don't think it would be that hard to get the puppet run data in prometheus; there are some simple ways to do it
[17:58:01] <herron>	 should we qualify/scope line 210 by appending something like “using current tooling”, or do you want to leave that open ended as well?
[17:58:14] <herron>	 paravoid: ^
[17:58:27] <paravoid>	 open-ended in the goal language sounds more flexible, but I'm game either way
[17:58:31] <herron>	 kk
[17:59:41] <XioNoX>	 fyi, "Check systemd state" is the #2 most noisy alert after the "puppet last run" one
[18:01:52] <cdanis>	 my hunch is that's because it's much more generic, not because there's nearly as many false positives?
[18:02:33] <shdubsh>	 in any case, a bit daunting might not be a bad thing.  The worst that could happen is "darn, missed that one"
[18:02:40] <paravoid>	 well
[18:02:42] <paravoid>	 yes and no :)
[18:02:55] <paravoid>	 I'd like to get a bit better at our goals in FY19-20
[18:03:09] <paravoid>	 as I was telling go.dog earlier
[18:03:21] <paravoid>	 lines 219-221 about statsd
[18:03:56] <paravoid>	 are really very similar (almost identical in scope but not size) to https://phabricator.wikimedia.org/T205870
[18:04:01] <paravoid>	 which was a Q2 goal
[18:04:15] <paravoid>	 which is still not complete -- and actually pretty far from complete
[18:04:26] <paravoid>	 so that is a bit disappointing
[18:04:55] <paravoid>	 also of note is that we made the varnish bits out of that (a tiny portion) into a Q4 goal, and didn't make that either
[18:34:46] <jbond42>	 shdubsh: ill try and update the task with my musing possible issues and fixes re puppet errors'
[18:35:25] <shdubsh>	 thanks!
[18:36:23] <jbond42>	 s/try/will/ :)
[22:03:07] <volans>	 I'm late to the goal discussion... but fwiw I agree we should do also some effor to fix the random puppet errors in the first place. For the random ones look more at it and try to fix them. I did a quick debug a while ago in https://phabricator.wikimedia.org/T201247#4647722
[22:03:22] <volans>	 we've too many of those to be acceptable IMHO.
[22:04:01] <volans>	 For the bad-patch ones, have automatic puppet compiler on CI might prevent a bunch of them (not all as some fail on the agent side)
[22:04:07] <volans>	 my 2 cents :)