[07:29:25] FIRING: SystemdUnitFailed: grafana-loki.service on grafana2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:59:25] RESOLVED: SystemdUnitFailed: grafana-loki.service on grafana2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:00:55] FIRING: SystemdUnitFailed: grafana-loki.service on grafana2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:55:55] RESOLVED: SystemdUnitFailed: grafana-loki.service on grafana2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:04:48] FIRING: PuppetFailure: Puppet has failed on logging-hd2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:14:48] RESOLVED: PuppetFailure: Puppet has failed on logging-hd2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:01:41] Hi, is there a way to automatically vary alerts based on active DC? [15:01:41] Since the swtichever, this (newish) alert is firing and flapping: [15:01:41] https://alerts.wikimedia.org/?q=%40state%3Dactive&q=alertname%3DEventgateProduceRateStop [15:01:52] I'd like to disable this alert for the inactive DC [19:05:57] ottomata: I don't know of any existing mechanisms to do that, but, I think it's actually quite simple to create one: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1191483 [19:17:59] (reviews welcome!) [20:00:15] wow [20:00:15] and [20:00:49] so, just by dropping a file in /var/lib/prometheus/node.d/mediawiki-conftool-state.prom, prometheus node exporter will export it? [20:00:52] that is pretty sweet [20:00:57] is that true on all nodes? [20:13:53] ottomata: yep! [20:14:22] we make pretty heavy use of it for all sorts of ad-hoc monitoring https://codesearch.wmcloud.org/search/?q=prometheus%2Fnode%5C.d&files=&excludeFiles=&repos= [20:18:19] cool! [20:21:46] πŸ’™root@config-master1001.eqiad.wmnet /etc/confd πŸ•ŸπŸ™ƒ cat /var/lib/prometheus/node.d/mediawiki-conftool-state.prom [20:21:47] # TYPE mediawiki_wmf_master_datacenter gauge [20:21:49] # HELP mediawiki_wmf_master_datacenter Constant 1 value with a site label for the primary datacenter [20:21:51] mediawiki_wmf_master_datacenter{site="codfw"} 1 [20:22:06] .... oh no [20:22:21] I should not have used the `site` label for this πŸ˜… [20:23:04] mediawiki_wmf_master_datacenter{cluster="misc", exported_site="codfw", instance="config-master1001:9100", job="node", site="eqiad"} [20:23:20] I'll make it `datacenter` I think [20:34:01] {{done}}