[01:28:25] FIRING: SystemdUnitFailed: sync_check_icinga_contacts.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:28:25] FIRING: [2x] SystemdUnitFailed: sync_check_icinga_contacts.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:31:33] ^ I've silenced it, I'll continue working on it tomorrow. [08:38:28] inflatador: thank you for reaching out, from reading the task it seems to me you are interested in looking at resource usage per-unit ? we do have those stats, I'll update the task [08:53:57] some cloud prometheus servers have started crashing with OOM errors, did you ever see something similar in prod? T370143 [08:53:58] T370143: toolforge: prometheus server died - https://phabricator.wikimedia.org/T370143 [09:05:27] dhinus: we did at the beginning of the year when there was increased load from k8s scraping, I'd guess it is either more metrics being ingested and/or heavy queries from users [09:05:54] dhinus: upgrading prometheus and bumping the ram did help and we haven't seen similar issues since then [09:09:35] thanks, I will try to upgrade prometheus and see if it helps [09:09:42] it's surprising that it goes down very quickly [09:09:52] so most of the time there are like 8GB free [09:17:16] yeah I suspect it might be an expensive query if all of a sudden there's an increase in memory [09:58:41] FIRING: [39x] PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [09:59:52] FIRING: [2x] ThanosRuleHighRuleEvaluationFailures: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [10:03:41] RESOLVED: [248x] PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [10:03:55] that was me ^ prometheus2006 upgrade [10:04:52] RESOLVED: [2x] ThanosRuleHighRuleEvaluationFailures: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [13:15:10] godog just saw your update, thanks for following up! [13:15:36] inflatador: sure no problem!