[03:53:34] FIRING: DiskSpace: Disk space titan2001:9100:/srv 1.996% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=titan2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [04:28:34] RESOLVED: DiskSpace: Disk space titan2001:9100:/srv 0.6775% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=titan2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [10:08:54] gehel, I'm taking a look at your patch now.. [10:16:23] tappof: thanks! let me know if you need more context [10:24:09] gehel: So, a couple of things: [10:24:16] I don’t think you need the max() by (instance) function in the rule expression, since you currently have only one series per instance, so it will always return a single value anyway. [10:24:31] I’d suggest always specifying evaluation_interval and interval in your rule tests. It makes reasoning easier and, in certain versions, relying on defaults can lead to unexpected behaviour. [10:24:56] values: '0+5x1000' ⇒ values: '0+60100x5'. The rate() function computes values per second, whereas the series values are per interval (1m). You need to manually compute the correct value so that rate() matches the intended threshold. [10:26:49] Last one: you swapped the step with the length of the series (0+5x1000 vs 0+60100x5). [10:28:38] I’m getting an alert firing in the test with the adjustments I just mentioned. The tests still aren’t passing because the label/annotation matching doesn’t align yet. [10:29:03] tappof: thanks! Comments applied to the CR, but the tests are still not passing. I've probably misunderstood something [10:29:19] I'm confused by all those different intervals, not sure what each is. [10:31:08] Oh, I think I'm getting there... [10:33:11] I think that https://gerrit.wikimedia.org/r/c/operations/alerts/+/1207790 might be working. tappof if you have 5 mote minutes to review, that would be super welcomed! [10:33:16] gehel: you can run the test locally following the instructions in the README file ... [10:34:17] yep, I have a passing test locally. But that doesn't mean that the CR entirely makes sense :) [11:35:13] gehel: I'm not really into the DB domain, so maybe you could involve someone from Data Persistence. Anyway, I'd suggest using rate() instead of irate() with a longer range [5m] and a for: parameter of at least 2 minutes. This helps reduce noise, but it really depends on what you're looking to monitor. You can always tune the rule further in a second iteration. [20:33:23] hi o11y - who's a good reviewer for https://gerrit.wikimedia.org/r/1208437 adding some alertmanager config? [20:34:13] rzl: I'm taking a look. [20:34:16] thanks! [20:43:31] ah thanks denisse for the quick turnaround, much appreciated :) [21:26:41] FIRING: [13x] PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance prometheus2005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [21:31:41] FIRING: [16x] PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance prometheus2005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures