[17:54:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: pyrra-filesystem-notify-thanos.path on titan1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:54:54] <denisse>	 ^This happened yesterday. I think we'll need to modify the service.
[17:55:16] <herron>	 yeah its hitting the start limit
[17:56:58] <denisse>	 herron: I'm thinking of incresing the Start Limit, to something like "StartLimitBurst=10" and "StartLimitIntervalSec=60". What do you think?
[17:57:43] <denisse>	 Another option would be to throttle the service restart but I think that we could lose information by doing that.
[17:58:24] <denisse>	 I'm creating a task to track the issue and discuss possible solutions further.
[17:58:41] <herron>	 I'm thinking something like TriggerLimitIntervalSec too
[17:59:15] <herron>	 the idea is for thanos to be reloaded when pyrra outputs a new rule file, which happens several times quickly when onboarding a new slo
[18:00:20] <denisse>	 I've created a task to track it, adding more info to it: https://phabricator.wikimedia.org/T364645
[18:00:24] <herron>	 but we don't really want thanos to be sent a reload however many times in rapid succession
[18:00:57] <herron>	 denisse: thanks
[18:01:03] <denisse>	 Thanks for sharing context on the purpose of the service, I'm adding that info to the task.
[18:04:47] <denisse>	 Looking at the `systemd` docs the `TriggerLimitIntervalSec` seems like a feasible solution too!
[18:05:35] <denisse>	 However, I'm worried that it may require manual intervention because of how it behaves: "If the limit is hit, the socket unit is placed into a failure mode, and will not be connectible anymore until restarted."
[18:06:52] <denisse>	 I think that `PollLimitIntervalSec` may be a better option for this as it would apply a temporary slowdown as compared to the permanent failure state that `TriggerLimitIntervalSec` would put the unit in. https://www.freedesktop.org/software/systemd/man/latest/systemd.socket.html#PollLimitIntervalSec=
[18:06:56] <denisse>	 What do you think?
[18:10:31] <denisse>	 While we decide on the appropriate solution I'll reset failed the units and restart them.
[18:10:55] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: pyrra-filesystem-notify-thanos.path on titan1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:11:41] <herron>	 that seems ideal but I don't think its available for a path
[18:12:51] <denisse>	 BTW, I didn't manage to reset failed nor restart the units as the issue self resolved.
[18:13:10] <herron>	 that was me :)
[18:13:19] <denisse>	 herron: Ah, that makes sense!