[09:03:54] fyi, I'm gong to depool magru at 13:00 UTC for a Junos OS upgrade (cc jelto, vgutierrez) [09:04:12] that's for the next shift then :) [09:04:20] thanks for the heads up :D [09:04:20] true! [09:04:35] :) [09:05:03] I can add it to corto [10:51:15] I am restarting theh various redis-servers dear oncallers [10:51:39] I do not think anything will happen [10:52:09] good luck, thank you 👍 [10:52:42] moritz made me do it, whatever comes, is his problem not mine [13:01:16] arnoldokoth, jhathaway, starting the magru depool/maintenance [13:03:15] Thanks [13:14:05] XioNoX: ack. Ty. [13:18:48] going to wait for the :20 mark to reboot cr1, that one shouldn't be impactful as the routers are redundant [13:23:56] it's rebooting [13:47:50] all good with cr1, rebooting cr2 [14:03:08] routers are both online and healthy, going to reboot asw1-b3 soon, that's going to be more impactful as 50% of the servers are connected there [14:03:15] but they're all downtimed [14:17:11] switch is back up [14:37:44] and 2nd switch back up, [14:39:07] moritzm: could you have a quick look at the Ganeti cluster to check if it's healthy? but afaik it's all good [14:39:47] looking [14:40:37] moritzm: as it's routed ganeti we could actually have moved all the VMs from one switch to the other before the work, I'll try to remember to do that for next time [14:41:03] indeed [14:41:10] looks all good [14:41:27] thx [14:57:26] magru work is done, just need to repool the site now (cc sukhe) [14:57:54] XioNoX: thanks. checking stuff once. [14:58:14] pooling DNS for magru first [15:01:25] removing downtime for A:magru [15:03:21] XioNoX: ready to pool if you are [15:03:30] (go for it or let me know if you want to :) [15:05:24] sukhe: go for it [15:06:00] verified BGP on the the DNS hosts as well, liberica looks happy too, proceeding [15:07:06] done [15:08:12] hmm [15:08:37] ok picking up [15:11:23] XioNoX: looks good, nice and clean <3 to celebrate this, I will have some bird patches your way this week to support IPv6-only-configs :P [15:14:42] nice! [16:27:46] Hi folks. I tried to decomm some nodes with sudo cookbook sre.hosts.decommission -t T404771 ms-be20[57-61].codfw.wmnet and it failed (after doing some initial work) with 'No hosts provided'. cf T404771 [16:29:31] like maybe it's done away with the puppet certs and now a subsequent check is failing? [16:38:04] Emperor: o/ from the logs it seems it is trying to get info from puppetmaster1001, so probably some leftover, lemme check [16:40:30] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1239986 [16:42:28] Emperor: ok if I run `test-cookbook -c 1239986 sre.hosts.decommission ms-be[2057-2061].codfw.wmnet -t T404771` to test it? [16:44:44] (stepping afk in a bit, you can run it if you want later on and report back!) [16:46:45] elukey: sorry, I went away to open T417670 to document the problem. Err, yes, please do go ahead (or I will tomorrow morning if not) [16:46:45] T417670: sre.hosts.decommission fails with >1 host, leaves hosts impossible to decommission - https://phabricator.wikimedia.org/T417670 [16:49:27] yep it is working [17:06:44] Emperor: done! [17:07:14] (thanks volan*s for the quick review <3) [17:08:26] anytime [17:09:28] thanks :)