[09:05:25] FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logstash2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:25:25] FIRING: [2x] SystemdUnitFailed: curator_actions_cluster_wide.service on logstash1026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:52:40] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:53:05] ^ Looking. [15:53:47] It looks like it was just a spike, the graph looks healthy now. I think it'll self resolve. [16:05:25] RESOLVED: [2x] SystemdUnitFailed: curator_actions_cluster_wide.service on logstash1026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:57:25] FIRING: [2x] SystemdUnitFailed: pyrra-filesystem-notify-thanos.path on titan1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:57:43] ^ Taking a look. [19:11:13] I've reset the failed units and restarted them again. I was unable to find anything in the logs that would cause the failure. [19:23:55] RESOLVED: [2x] SystemdUnitFailed: pyrra-filesystem-notify-thanos.path on titan1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:48:48] for icinga scripts that return WARNING, CRITICAL etc, is it necessary to output to stdout, or is stderr acceptable? [19:52:13] just wondering if this line is enough for icinga https://gitlab.wikimedia.org/repos/search-platform/sre/lvs_l2_checker/-/blob/pools.json/lvs_l2_checker.py?ref_type=heads#L132 [19:55:57] inflatador: nagios plugin guidelines says yes: https://nagios-plugins.org/doc/guidelines.html#PLUGOUTPUT [19:56:46] cwhite thanks, extremely clear answer on those docs ;)