[12:30:29] anyone know how to deal with the icinga error 'Stale template error files present for '/srv/config-master/pybal/eqiad/thanos-query' [12:30:33] https://cas-icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=puppetmaster1001&service=Confd+template+for+%2Fsrv%2Fconfig-master%2Fpybal%2Feqiad%2Fthanos-query [12:32:47] I think I have an idea, I'll take a look since that's likely from my earlier setup of those services in eqiad [12:33:11] thank godog [12:36:03] yeah the linked wikitech page seems accurate to me jbond42 [12:36:56] I'll look into the other confd error about compilation broken [12:37:00] godog: on sorry i just hovered over it and though it went to a a generi https://wikitech.wikimedia.org/wiki/Monitoring page [12:38:31] ah, I got to https://wikitech.wikimedia.org/wiki/Confd#Monitoring that's how I found the reference [12:38:44] yes was user error, reading now [12:39:07] possibly the ssh can be replaced with a cumin query I think [12:41:52] which ssh? [12:43:54] that error is generated when a confd templates generates an invalid configuration (like pooling both sites for an active/passive service) [12:44:06] and stays there after the causing issue gets fixed [12:44:17] they can be cleaned up if the underlying issue has been cleared [12:44:47] jbond42: the ssh mentioend in the wikitech docs above [12:44:59] oh i see :D thanks [12:46:25] jbond42: similar context: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/08-restore-ttl.py#25 [12:46:58] those are for the discovery dns, but the logic is similar IIRC [12:47:36] ack thanks [12:47:44] actually no I'm mistaken, the error can happen anywhere we're using confd::file, but most commonly on puppetmaster/pybal [13:23:08] godog: picking on you as the EU representative of observability :) i want to add an alert based on prometheus metrics that only notifies myself and marostegui. is there a recommended way of doing this? [13:23:33] it looks like grafana might be able to do this, but we'd need to have a graph that shows all db hosts, which isn't great [13:23:57] (the alert relates to replication lag) [13:26:26] kormat: notifies as in sends pages via VO? [13:27:04] godog: no, nothing so severe. i guess emails [13:27:19] this is going to be noisy until we fine-tune it, and i don't want to be spamming anyone else [13:29:01] kormat: ah ok got it, yeah check out monitoring::check_prometheus define in puppet, that should do it although it might require a new icinga contactgroup [13:30:34] kormat: another thing to consider is that the define will want a prometheus server to query, so you'll need to iterate on codfw/eqiad with .each in puppet, there are a few examples in puppet [13:30:49] i'll have a look, cheers [13:31:37] np! LMK if a rabbit hole shows up [13:46:50] kormat: FYI at the moment we don't have anything from icinga that sends email and doesn't page. Either is UI/IRC only or is UI/IRC/PAGE/EMAIL. [13:49:31] volans: and i'm guessing "IRC" there means "send all alerts to #-operations"? [13:50:28] because if there was an option for UI/IRC(#-databases) that might work [13:50:59] we have that option, yes, but needs to be configured [13:51:05] hmm, ok [13:51:08] we have different irc logs on icinga that go to different chans [13:51:20] but it's just a matter of creating the new "channel" [13:51:37] ok cool [13:52:00] it looks like the hard bit of all this is going to be puppet. weeee [13:52:10] try git grep irc-releng [13:52:13] for example [13:52:39] it's a bit painful because requires quite some hardcoded bits [13:52:59] so if not too spammy you could use the normal IRC in -operations [13:53:07] alternative is to set it with disabled notifications at the start [13:53:14] in that case you'll see it only in the icinga UI [13:53:21] (and if you use things like aNag) [13:54:02] depends how long will be the experimentation part vs effort to add the new dedicated chan vs will this chan be used for other things too [13:54:46] (what's aNag?) [13:55:24] <_joe_> kormat: a browser extension to ensure you're properly distracted every 5 minutes [13:55:34] android app to have nagios alerts on your phone based on custom rules :D [13:55:49] _joe_: that's _just_ what my concentration has been missing! ;) [13:55:50] <_joe_> oh that' sthe android app, right not the browser extension [13:56:04] <_joe_> so this is an android app doing the same