[10:23:34] FIRING: DiskSpace: Disk space titan2001:9100:/srv 0.3483% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=titan2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [10:38:34] RESOLVED: DiskSpace: Disk space titan2001:9100:/srv 0.004537% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=titan2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:25:12] FIRING: ThanosCompactHalted: Thanos Compact has failed to run and is now halted. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactHalted [12:15:38] ^^ related to T410152 [12:15:38] T410152: Disk space saturation (/srv) on Titan hosts - https://phabricator.wikimedia.org/T410152 [14:04:13] Hey! I need some help to write an alert. I have my attempt here: https://gerrit.wikimedia.org/r/c/operations/alerts/+/1207790. The tests are passing, but it's the wrong check at the moment. I'd like to alert when the rate of query is exceeding 1000. I can't understand how to write that. Or how to make the test pass. [15:14:46] Heu gehel, I'll try to take a look at it tomorrow. [15:14:57] s/Heu/Hey [15:25:27] FIRING: ThanosCompactHalted: Thanos Compact has failed to run and is now halted. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactHalted [16:01:18] tappof: thanks! [16:01:30] tappof: ping me if you want more context. [16:02:15] I'm somewhat frustrated that what looks like a very simple alert (send an alert if this line goes over this threshold) is so difficult for me to implement :/ [16:21:22] gehel: the `values` in the `input_series`, which is interval associated with that? [16:21:38] if you are only getting values every 5 minutes, the 5-minute rate will be NaN :) [16:22:25] FIRING: SystemdUnitFailed: stunnel4.service on grafana1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:27:25] RESOLVED: SystemdUnitFailed: stunnel4.service on grafana1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:32:40] cdanis: I'm not at all clear in what the different interval means. So maybe I should play with that? [16:45:56] cdanis: do you have a pointer to documentation of those different intervals? [16:56:46] gehel: it looks like the interval in the test_group defaults to the evaluation_interval, which, is what you want [16:56:49] https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/ [16:57:31] there's also docs there on a lot of syntactic sugar for writing the values data [17:27:49] cdanis: thanks for the link. I'll give this another try tomorrow. I might need some pairing with someone who understands this better. A lot of the words have no meaning to me... [17:39:49] commented very directly on the patch :) [19:25:27] FIRING: ThanosCompactHalted: Thanos Compact has failed to run and is now halted. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactHalted [22:00:12] RESOLVED: ThanosCompactHalted: Thanos Compact has failed to run and is now halted. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactHalted [22:00:25] FIRING: SystemdUnitFailed: thanos-compact.service on titan2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:01:01] FIRING: ThanosCompactIsDown: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [23:29:16] RESOLVED: SystemdUnitFailed: thanos-compact.service on titan2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:29:17] RESOLVED: ThanosCompactIsDown: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown