[02:11:35] FIRING: ThanosCompactIsDown: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [06:11:35] FIRING: ThanosCompactIsDown: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [07:50:50] RESOLVED: ThanosCompactIsDown: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [08:26:19] I've restarted thanos compact and thus the alert recovered, my bad though as I should have investigated the alert not showing up in karma before the alert resolved [08:38:46] Anyone willing to +1 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1075027 please? I _think_ it's a pretty simple improvement :) [08:53:39] +1ed Emperor [08:54:07] <3 [08:54:20] enjoy your metrics aggregated! [13:38:56] FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logstash2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:33:56] FIRING: [2x] SystemdUnitFailed: opensearch-dashboards.service on logstash1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:35:52] now that's interesting ^^ [15:36:16] Yes. [15:36:27] I wonder if it could be related to the failovers. [15:36:43] opensearch-dashboards.service was fine, however run-dashboards-backup.service was not. The alert appears to have kept the description on resend [15:38:28] Both are SystemdUnitFailed, but I guess alertmanager has to decide what description to send and prefers to keep the last one even if a different alert under the same name is firing? [15:39:28] * cwhite watches to see how the recoveries come through [15:48:56] FIRING: [2x] SystemdUnitFailed: run-dashboards-backup.service on logstash1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:51:07] very interesting. that alert is for SystemdUnitFailed curator_actions_cluster_wide.service on logstash2026 [15:51:46] * cwhite corrects it and continues watching [15:53:44] also, there are two alerts in karma and the difference is the @receiver label: (olly-irc|default) [15:53:56] RESOLVED: [2x] SystemdUnitFailed: run-dashboards-backup.service on logstash1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:54:35] there we go - this ^^ was in response to fixing curator_actions_cluster_wide.service on logstash2026 [15:54:43] * cwhite scratches head