[01:04:48] FIRING: PuppetFailure: Puppet has failed on logging-hd2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:14:48] RESOLVED: PuppetFailure: Puppet has failed on logging-hd2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:40:28] Would there be a good soul to help me write a prometheus unit test? I've been at it for hours, the usecase seems simple, and yet I feel like maybe adding that monitor isn't worth *that* amount of pain [09:40:31] thank you! [09:50:14] Sure brouberol, could you share the patch on Gerrit? I'll take a look. PromQL unit tests aren’t (typically) too complex once you get used to them, but they do require a bit of confidence, especially in how they handle timing. [09:51:51] Thanks! The patch is https://gerrit.wikimedia.org/r/c/operations/alerts/+/1271612 and my current issue is with the `heartbeating_scheduler_no_alert_expected` test case, for which no alert should fire, but `promtool test rules` disagrees [09:52:23] the alert is for a gauge called airflow_scheduler_heartbeat, that should be ever-increasing, and I'm alerting for when it stays flat for too long [09:56:49] seems like I was missing an `interval: 1m` [10:03:13] is there a way to get debug information out of `promtool test rules`? The evaluated metrics, decision tree, anything more than success/failure ? [10:06:29] my main frustration with these unit tests is that they are very (IMO) finnicky to get right and promtool gives us 0 introspection into what it's doing [10:06:34] at least that I could find [10:11:35] Are you running the tests using the instructions provided in the README? There are more tests executed via tox, other than those performed by promtool test rules. One of them checks the interval field. [11:47:29] yes, I'm running them within the docker image. I was running them via `promtool tets rules ` because `tox` takes quite some time, so I wanted a faster feedback loop to understand what was going on [11:55:10] but that's a good point, I'll run the whole tox suite to see if it shines some light on the failure [12:19:23] Ack, if you'd like to run promtool test rules directly (which is totally fine for faster iteration, especially after a first full test run through Docker to rule out issues not covered by promtool), I'd suggest running it inside the container. This way, you're sure to use the same version as in production. [12:33:40] yep, this is indeed what I'm doing [12:34:38] but my main complaint here is that `promtool test rules`has no introspection, no way of checking what data was evaluated, no debug logs, etc. It's either pass/fail, which makes it hard to investigate *why* things aren't behaving as expected [12:35:32] again, I realize I'm complaining about an upstream project, aka not much you can do here, but AFAIK I'm not the only one who's daunted at the prospect of adding a new monitor because of this type of papercuts, and it's a real, non negligible, time drain for us :/ [13:25:24] cccccclllnencgekkbnrvttcrhbuhhiecivurujrbiuh [13:25:29] lol [13:25:38] cccccdcgbkijgcnibugbiefrvhnedtenhclbltujilbf [13:25:52] I wouldn't have said it better myself [13:40:19] rotfl [13:52:46] cccccdbcihrfeenuirjefjjkecjnhkjbhvgkgihgdlfh [14:09:39] usual reminder that the yubi otp app can be turned off if you don't use it [15:20:17] cccccdbcljjlevrgtlgnikrjfvetdfuubfineejrdctj [15:33:17] hey o11y folks, do we have any API access to OCO schedule data? [19:49:12] FIRING: ThanosCompactHalted: Thanos Compact on titan2001 has failed to run and is now halted. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactHalted [21:04:12] RESOLVED: ThanosCompactHalted: Thanos Compact on titan2001 has failed to run and is now halted. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactHalted [21:06:25] FIRING: SystemdUnitFailed: thanos-compact.service on titan2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:03:12] ^^ The thanos tools bucket cleanup command is currently running to fix overlapping blocks. It will take a while. I’ll restart Thanos Compact tomorrow morning once the cleanup is finished. Acknowledging the alert.