[17:54:25] FIRING: [2x] SystemdUnitFailed: pyrra-filesystem-notify-thanos.path on titan1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:54:54] ^This happened yesterday. I think we'll need to modify the service. [17:55:16] yeah its hitting the start limit [17:56:58] herron: I'm thinking of incresing the Start Limit, to something like "StartLimitBurst=10" and "StartLimitIntervalSec=60". What do you think? [17:57:43] Another option would be to throttle the service restart but I think that we could lose information by doing that. [17:58:24] I'm creating a task to track the issue and discuss possible solutions further. [17:58:41] I'm thinking something like TriggerLimitIntervalSec too [17:59:15] the idea is for thanos to be reloaded when pyrra outputs a new rule file, which happens several times quickly when onboarding a new slo [18:00:20] I've created a task to track it, adding more info to it: https://phabricator.wikimedia.org/T364645 [18:00:24] but we don't really want thanos to be sent a reload however many times in rapid succession [18:00:57] denisse: thanks [18:01:03] Thanks for sharing context on the purpose of the service, I'm adding that info to the task. [18:04:47] Looking at the `systemd` docs the `TriggerLimitIntervalSec` seems like a feasible solution too! [18:05:35] However, I'm worried that it may require manual intervention because of how it behaves: "If the limit is hit, the socket unit is placed into a failure mode, and will not be connectible anymore until restarted." [18:06:52] I think that `PollLimitIntervalSec` may be a better option for this as it would apply a temporary slowdown as compared to the permanent failure state that `TriggerLimitIntervalSec` would put the unit in. https://www.freedesktop.org/software/systemd/man/latest/systemd.socket.html#PollLimitIntervalSec= [18:06:56] What do you think? [18:10:31] While we decide on the appropriate solution I'll reset failed the units and restart them. [18:10:55] RESOLVED: [2x] SystemdUnitFailed: pyrra-filesystem-notify-thanos.path on titan1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:11:41] that seems ideal but I don't think its available for a path [18:12:51] BTW, I didn't manage to reset failed nor restart the units as the issue self resolved. [18:13:10] that was me :) [18:13:19] herron: Ah, that makes sense!