[08:59:27] godog: is there a sane way to deploy a change that renames an existing alert which has a ton of downtimes active? [09:01:30] kormat: good question, not afaik, as in "be ready to re-add the downtimes" isn't sane [09:01:46] that's what i was afraid of :) [09:02:04] in this case i ended up waiting 10 days until there was only one active downtime [09:02:12] small note- also alert disablings- while there should not be a lot of those that are not automated [09:02:36] there could be some like for very old hw that is about to be decomme or something we don't want alerts [09:03:58] godog: i really really miss the prom alertmanager approach, where you could just duplicate the existing silence(s) with the new name, and then deploy. [09:05:39] kormat: https://i.imgflip.com/4bprwg.jpg [09:06:12] that is to say, "same here" [09:06:25] godog: :D i look forward to that. is there a rough timeline yet? [09:07:12] that would fix I think way more than that, I want to belive! [09:07:17] kormat: if you still need to do it though I have some workaround that would allow you to do it [09:07:27] needs a bit of work, but we have most pieces handy [09:08:35] kormat: yes, this quarter to have all the bits in place and show all icinga alerts, in "read only" essentially, then likely next quarter start moving alerts over [09:09:12] I'd like to minimize the transition where we're like "oh wait, should I silence that alert in X or Y?" [09:12:19] godog: 🎉 [09:12:34] volans: i fear what kind of "elegant solution" you may have ;) [09:14:09] super elegant, we have a parser for status.dat, from which it should be trivial to match aall the alert disabled and downtimed that match your name pattern, collect the names, do the rename and then run the donwtime/disable command for those as soon as you merge the new check and run puppet on the icinga host [09:15:16] volans: there's a command to downtime a _service_ rather than a host? [09:15:44] not a CLI, just echo a string line to icinga command file [09:15:54] oh lord [09:16:04] with a specific format, but spicerack cand do that formatting for you [09:16:19] basically you could have a cookbook do that if you'r einterested :D [09:16:40] not sure is time well spent :-P [09:17:05] yeah agreed [09:17:24] i think i'll just put my faith in hoping that we have alertmanager before i need to do something like this again ;) [09:31:32] morning all back from vacation, just going through back logf but feel free to ping if i can help with anything [09:32:01] welcome back jbond42 [09:33:12] thx [11:31:54] jbond42: welcome back! [11:32:19] I have a task that could use your input; I will add you to it but it's *not urgent*, so whenever you find time [11:33:36] thanks sukhe and yes sounds good ping me later in the week if you dont see any feedback :) [11:33:47] thank you! [12:06:36] kormat: I've done what you're asking with shell one-liners... [12:15:30] cdanis: you know those threads on twitter where people post a picture and ask "What is this? (wrong answers only)". i feel that's your approach to everything scriptable [12:16:26] it actually seemed quite safe as the output formatting is done by a very dumb piece of C code, so you can actually rely on line number offsets within the block [12:19:26] let me consult my notes. ah, here we go: ಠ_ಠ [16:21:00] can someone think of a reason why I can have spikes of TCP errors (retransmits) when bandwidth usage is low? (but no errors when network usage is high)? [16:21:25] what's happening on the other end? [16:21:32] let me see [16:22:05] well, it had stopped transmitting [16:34:36] buff, lots of C buffer errors [16:34:46] going to do a clean restart [16:34:56] that can happen due to cpu saturation [19:42:41] Failed to call 'spicerack.remote.wait_reboot_since' [1/25, retrying in 10.00s]: Cumin execution failed (exit_code=2) [19:43:28] "Successful Puppet run too old" [19:45:05] mutante: might be that way if the host didn't reboot [19:45:09] IIRC [19:45:38] it also says "Found reboot since 2020-08-17 19:39:53.060768" though [19:46:12] mutante: but it ended there or just was the first attempt? [19:46:17] because if it 's polling it's ok [19:46:49] it keeps going but the screen is more or less full of "failed to call" [19:47:09] it's polling [19:47:12] 1/25 retries [19:47:26] untill it finds a puppet run after the reboot [19:47:36] * volans doesn't recall it has to be successful [19:52:40] ok [19:52:46] i think i should just manually reboot it [19:53:09] when it's just a single host that is [19:55:17] mutante: can you login on the host? is it rebootedhas pupept run? was it successful? [19:56:52] volans: yes, i can. it was succesful and already has 14 min uptime. [19:58:22] the cookbook said "not all services are recovered" so i am looking at icinga and all is green there as well [19:59:19] icinga has a high latency [19:59:24] might have not recevered in time [20:00:53] ack