[14:43:25] volans: i noticed there was an issue where wmf_auto_reimage failed to pick up the agent fingerprint as the puppet service was not correctly masked. It reminded me of an old CR wmf_auto_reimage and wondered if you had further comments, im tempted to abandon it if not [14:43:46] old CR: https://gerrit.wikimedia.org/r/c/operations/puppet/+/515051 [14:46:15] jbond42: let me see [14:46:38] I totally forgot about i [14:46:39] *it [14:47:24] looking at the comments neither of use agread if it was a good idea or not, so i think we both just forgot about it :) [14:48:26] yeah making it more resilient might hide other issues like this one [14:48:32] we discovered the other day [14:48:56] I'm tempted to say abandon, not because it's a bad patch, but because it's "too good"™ :=) [14:49:00] what are your thoughts? [14:51:39] lol :) im still on the fence argubly the issues from the other day wouldn;t have been an issues as the script would have worked and puppet (i think) would come along and mask the puppet service on the first run. That said we dont know what other issues it may mask in the furture, also the patch has been there a ~year and wouldn;t have helped that much so ill abandon [14:55:19] ack fair enough [14:55:27] and also will still be available in case we need it ;) [14:55:58] yes exactly just moves it out of my gerrit queue :) [14:56:54] +1 [15:03:53] spent some time earlier today with (free)ipmi and T253810, seems doable to me although it'd mean requiring a SEL clear to clear the icinga alert [15:04:23] the comma confused wm-bot2 didn't it, I meant T253810 [15:04:35] :( [15:10:19] I think there was an issue with the bot IIRC from -operations earlier today godog [15:10:57] ah, that explains, thanks volans [15:11:23] the script doesn't have a way to check the datetime of the SEL messages and discard if older than X? [15:12:14] sometimes is useful to see that 4 months ago we got the same error, and if we loose the mgmt logs and go past the retention log period we lost them completely right? [15:14:06] yes that's correct [15:15:17] if the retention is a problem we can certainly extend centrallog's retention, we have the current policy of 90d because it is easy, although I doubt it would be a problem to extend [15:16:35] ipmi-sel has --date-range [15:22:44] interesting, looks like minimum one day, how would the alerting/check_ipmi_sensor work in that case ? [15:22:56] IMO the things to do here are 1) deploy ipmi seld to enable syslogging, 2) enable ipmi SEL memory checks (should be able to just add '-T Memory' to the existing nrpe check_ipmi_sensor), 3) keep the icinga alert open until the SEL is cleared, 4) update dcops process to included clearing ipmi SEL when physical memory replacement is complete [15:25:21] yeah I was thinking sth along those lines too [15:25:31] godog: if we limit to last day, there is no need to clear it [15:26:27] paravoid: should we still put our updates in the pad? [15:26:53] volans: the alert would auto-clear say over a weekend though (?) [15:28:02] it seems we need some side action anyway, if we get alerted during the weekend but is non-paging, we would not have prevented the same issue [15:28:44] if we treat it as a hardware failure like the disks we can hook up the same handler that open the broken disks task [15:28:56] but again, would not prevent the issue over a weeked [15:28:58] *weekend [15:29:07] IMHO that's hard to say we wouldn't have prevented the issue [15:30:30] if we're confident this alert should be acted upon within few hours seems a candidate for paging to me :) [15:31:38] still, it seems a bit awkward for a system in broken state and needing manual attention to have the related alert automatically recover [15:33:18] herron: agree and we could play with the dates or with the mechanism to clear the alert. [15:33:40] what I find a step back is clearing the SEL that often has the history of the device since it was bought [15:33:58] those are my 2 cents [15:36:20] yeah that's fair, I think if we get SEL to syslog then it is likely good enough, vs not alerting on sel at all [15:38:30] that will be good too for some edge cases where the SEL becomes full and must be cleared [15:38:48] maybe we could log this to a long lived file on the host as well, the volume should be very very low [15:40:08] that too, FWIW to me the biggest downside is to actually have to clear the SEL but I don't think there's a way around it [15:41:04] using the current sensor script probably not, if we wrap it sure there are [16:27:15] cdanis: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cr2-eqord&service=BGP+status ;) [16:28:23] Zayo? [16:28:27] in eqord really [16:33:04] email sent [19:35:25] https://labs.ripe.net/Members/cteusche/from-rex-to-ripestat "120 million requests per day" 😮 [21:10:11] 1.4krps is pretty decent