[13:00:02] cdanis: morning, do you have any thoughts on the icinga2001 failures during the weekend? (see my email too) [13:00:29] I saw your email but did not take a look yet myself [13:00:36] limiting the bw sounds good ofc [13:05:44] hm, are you sure it was that though? there were lots of "too many open files" that day on those times [13:06:11] not 100% sure, no, also when I had a chance to ssh was already recovered [13:06:14] so nothing to check live [13:06:59] https://phabricator.wikimedia.org/P8478 [13:07:46] so the first problem message (10:34 UTC) is the usual "the checker script ran during a sync cronjob" [13:08:04] maybe was that aggravated by the mdadm stuff [13:08:12] yeah, it didn't take that much longer than usual [13:08:18] the later ones, with the lagging external command checks, correlate very well with the too many open files messages [13:08:19] that slowing things down was allowing for a quicker too many open files [13:08:47] let's increase the open files today [13:10:14] * volans having a look at it [13:10:31] if you have a minute, can you offer opinions on what is horrible about https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/508011/ [13:21:17] surely the commit message :-P [14:00:19] cdanis: RE the above, it seems ok to me as a bandaid. A self-clearing alert ofc is not generically ideal, but I understand the purpose of it, to show something on IRC and Icinga that tells people that some data might be missing around that time [14:01:09] I'm not sure if the scalar() is needed as the current dashboard doesn't have it, but I'm no prometheus expert ;) [16:29:17] I have the right idea that to update check_prometheus rules -- which are exported resources generated on the machines with role::prometheus -- I need to run puppet there, wait, and then run puppet on the icinga hosts, right?