[08:02:19] <volans>	 jbond42, moritzm: so the new systemd timer for debmonitor failed on gerrit1001 because of "502 Proxy Error" (it can happen). But because the timer is once a day, it leaves the systemd failed unit for a whole day, with Icinga alerting. I was wondering that we could add some retry logic to the unit so that it retries 2~3 times with 1~5 minutes of sleep in between?
[08:02:25] <volans>	 thoughts?
[08:05:34] <jbond42>	 volans: not too familure with the job, I thikn adding retry logic seems fine however i wonder why it times out at all, im gussing it would use tcp to localhost?
[08:07:43] <volans>	 jbond42: it failed at 00:00:03
[08:08:34] <volans>	 apache reload triggered by logrotate?
[08:08:49] <volans>	 shouldn't be, but maybe something similar
[08:08:56] <jbond42>	 did it also have simlar faliures (which went unoiticed) when it was in cron?
[08:09:30] <volans>	 it's possible, it's a daily job that is there just to reconcile data in case an update on apt-get actions got missing because of debmonitor down
[08:10:00] <jbond42>	 ack ill take a look
[08:10:02] <volans>	 so, being a  failsafe, hadly noticeable if one day for one or few host failed
[08:10:06] <volans>	 *hardly
[08:16:58] <moritzm>	 yeah, I'm pretty sure since this job runs fleet-wide we had similar failures before with the cron which simply went unnoticed
[08:17:47] <jbond42>	 wait which job we talking about, i was looking at debmonitor-maintenance-gc which only runs on the server
[08:18:39] <moritzm>	 I think Riccardo meant the daily ingestion run
[08:18:52] <moritzm>	 if basically commits the dpkg state of a server to debmonitor
[08:19:07] <moritzm>	 all updates via apt are submitted with an apt hook
[08:19:21] <moritzm>	 but there's no such hook for "dpkg -i"
[08:19:51] <volans>	 jbond42: the new systemd timer for debmonitor client, it failed on gerrit1001  tonight
[08:20:02] <moritzm>	 so daily the full state is ingested to catch up with such edge cases (or e.g. if the apt hook failed for some reason)
[08:21:34] <jbond42>	 ahh ok thanks that makes senses why i couldn't see it on debmonitor1002 :), also the debmon-client job got changed in feb so at least its not supper flacky
[08:22:17] <jbond42>	 further it looks like loogrotate did run at 00:00:03 on debmon1002 so thats seems like the most likley cause
[08:24:48] <volans>	 the other option is to add some retry logic in the client for the requests POST, that's sensible too. The retry at unit level would cover also cases of issues with apt
[08:24:56] <volans>	 so dunno what's best here :)
[08:25:24] <volans>	 too bad we can't use wmflib.requests.http_session here
[08:29:02] <jbond42>	 for no i have sent a bandade (https://gerrit.wikimedia.org/r/c/operations/puppet/+/679263) but i think adding retry logic makes senses, ill also look at why the logrotate caused this, could be that either envoy or apache dosn;t reload as cleanly as nginx
[08:32:51] <volans>	 if only gerrit was loading for me...
[08:34:02] <moritzm>	 works for me
[08:35:32] <volans>	 works now, took a minute though
[08:42:39] <volans>	 jbond42: FYI the GC cron was not removed, I think because you put it absent withput the command, that is actually needed to absent the resource IIRC
[08:43:16] <jbond42>	 volans: it shouldn't be but will take a look 
[08:43:43] <volans>	 sorry for bothering you :)
[08:45:08] <jbond42>	 volans: it dose need the user to remain though :), ill remove them manully thanks
[08:45:28] <volans>	 no prob, thank you
[08:47:06] <jbond42>	 volans: in relation to adding retry one issue here is that the data  is sent as a post which strickly speaking shouldn't get retried, however i think in this case iot wouldn;t really mater if the same data got submited twice, is that right?
[08:48:22] <volans>	 yes, sending it twice would just ensure the same "state" for that host is present in debmonitor, we don't keep track of the past, just the "current" status
[08:48:24] <moritzm>	 yes, it's idempotent
[08:48:31] <volans>	 so technically i idempotent
[08:48:32] <jbond42>	 ack thanks
[08:48:36] <moritzm>	 one other alternative:
[08:48:53] <volans>	 I think we have 2 approaches here:
[08:48:55] <moritzm>	 we could also extend systemd::timer::job with a flag to make it not fail
[08:49:13] <moritzm>	 and then set it for the debmonitor reconciliation run
[08:49:21] <volans>	 - a general retry logic for timers, to reduce systemd icinga spam
[08:49:31] <moritzm>	 and some other systemd timer jobs probably have the same lax requirements
[08:49:42] <volans>	 - what moritz said, basically converting timers like cron, not failing but sending an email maybe
[08:49:52] <moritzm>	 to not warrant to fail the entire systemd state
[08:49:56] <volans>	 - add the retry in the debmonitor client, easier but covers just this one case
[08:51:39] <jbond42>	 ack i can check that as well i think we almost have that with the email flag
[17:34:01] <volans>	 FYI I've replied to the PXE task
[21:43:41] <legoktm>	 fyi debmonitor-client.service failed on ml-serve2004 also with a 502 Bad gateway error at ~21:21
[21:45:12] <volans>	 legoktm: can you just restart it please? j.bond is looking at it
[21:46:06] <volans>	 we changed the server side and have some ideas on what might cause it
[21:46:08] <legoktm>	 already did :)
[21:51:45] <volans>	 thanks a lot