[13:00:02] <volans>	 cdanis: morning, do you have any thoughts on the icinga2001 failures during the weekend? (see my email too)
[13:00:29] <cdanis>	 I saw your email but did not take a look yet myself
[13:00:36] <cdanis>	 limiting the bw sounds good ofc
[13:05:44] <cdanis>	 hm, are you sure it was that though? there were lots of "too many open files" that day on those times
[13:06:11] <volans>	 not 100% sure, no, also when I had a chance to ssh was already recovered
[13:06:14] <volans>	 so nothing to check live
[13:06:59] <cdanis>	 https://phabricator.wikimedia.org/P8478
[13:07:46] <cdanis>	 so the first problem message (10:34 UTC) is the usual "the checker script ran during a sync cronjob"
[13:08:04] <volans>	 maybe was that aggravated by the mdadm stuff
[13:08:12] <cdanis>	 yeah, it didn't take that much longer than usual
[13:08:18] <cdanis>	 the later ones, with the lagging external command checks, correlate very well with the too many open files messages
[13:08:19] <volans>	 that slowing things down was allowing for a quicker too many open files
[13:08:47] <volans>	 let's increase the open files today
[13:10:14] * volans having a look at it
[13:10:31] <cdanis>	 if you have a minute, can you offer opinions on what is horrible about https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/508011/
[13:21:17] <volans>	 surely the commit message :-P
[14:00:19] <volans>	 cdanis: RE the above, it seems ok to me as a bandaid. A self-clearing alert ofc is not generically ideal, but I understand the purpose of it, to show something on IRC and Icinga that tells people that some data might be missing around that time
[14:01:09] <volans>	 I'm not sure if the scalar() is needed as the current dashboard doesn't have it, but I'm no prometheus expert ;)
[16:29:17] <cdanis>	 I have the right idea that to update check_prometheus rules -- which are exported resources generated on the machines with role::prometheus -- I need to run puppet there, wait, and then run puppet on the icinga hosts, right?