[14:09:26] volans: XioNoX: fyi i have submited the rotat-pdu-snmp changeset [14:11:50] jbond42: ack, feel free to test them and send any follow up fix might be needed [14:12:23] not sure which target pdu might be a good candidate, check netbox for an empty/unimportant rack [14:14:19] i have tested with a v3 and a V4 host allready. v3 seems to work pretty fine. v4 somtimes fails to restart the pdu which needs exploring. however it's probably in a good state to deplot it would mean we can at least reset all the password then just manuly restart the v4's which failed [14:15:10] is there a way to detect if they fail and maybe retry? [14:15:19] or the reason why and avoid it ofc :D [14:16:42] not that i have seen, the response comes back all good [14:17:29] ill check a different pdu as well could be just the one im trying is tempremental [14:18:38] any kind of uptime we could check? [14:19:00] * volans has no idea what the PDU interface looks like [14:19:18] your lucky [14:20:49] jbond42: yeah that's something that won't run often so if there is a handful of pdus that need a restart it's fine to do manually [14:24:15] volans: there is actualy a text file which has the uptime https://localhost:8443/CDU/summary.txt however the problem is then we have to wait around for the box to restart before moving on to the next object. I could perhaps create a seperate small cook book which just gets the uptime. this could be run after the upgrade to see which ones been haven't restarted [14:24:26] XioNoX: ack thanks [14:27:00] i think we also need to keep the balance of getting the script out there so we can restart the current passwords vs having a perfect script [14:29:01] jbond42: ack, I was not meaning that as a blocker [14:29:19] but as a way to know where the rotation has not been done, because that's a liability if we're not aware [14:29:45] volans: ack sure ill follow up with sre.pdus.uptime script shortly [14:30:08] as for the waiting the reboot, is there any risk of doing the next pdu while the previous is already rebooting? [14:30:28] I'm thinking if we do the ones in the same rack sequentially one after the other [14:30:39] XioNoX: or maybe someone from dc-ops is probably best to answer that [14:30:48] but good point [14:53:27] jbond42: you can check the uptime from librenms [14:54:03] volans: no issues it only restarts the onboard computer not the power ports [14:54:30] might be worth doing it sequentially just in case if it's no extra work [14:58:31] ack ill take a look at updating with some test and checks. i also managed to get an uptime cookbook pretty quickly https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/604411. its probably best testing directly to avoid any race conditions. [14:58:37] I think it's already sequential, the only thing is that we issue the reboot and pass to the next one IIRC [14:58:46] so we might reboot the next one while the prev is still rebooting [15:02:58] if we're certain it can't affect the power ports in anyway that's ok [15:29:59] is there a doc/pad about the team's Q goals? [15:31:56] also any idea what's up with https://phabricator.wikimedia.org/T250792 ? [15:37:34] XioNoX: likely a result of the recent upgrade [16:10:23] XioNoX: it looks like allow_embedding defaults to false and we have it unset. Is this recent breakage or a new dashboard in librenms? [17:07:08] shdubsh: yeah it used to work and I opened the task when I noticed it "broke" [17:11:45] I bolded some items that felt solidly in the "these are happening realm" [17:11:56] so prioritize OKR wording for those [17:12:40] (if it's not bold it doesn't mean that it's not happening necessarily though, just that it's on more shaky grounds :) [17:13:33] k, thx [17:14:55] if you disagree, lmk :) [17:18:17] XioNoX: that's a bit surprising. this header should have been present since grafana 6.2.0 (May 2019) [17:33:47] dunno :)