[03:09:40] If anyone wants to start their day with some reboots, there are a few hosts left to go on T426563 [03:10:47] ...and more than a handful on T426560 [03:25:39] thank you andrewbogott [07:36:10] given that there isn't anoyone around yet I'll start with the cumim hosts if noone has objections [07:39:01] greetings [07:39:04] volans: +1 [07:39:19] ok starting with 2001, will do 1001 in a few and then we can continue with the rest [07:44:27] 2001 done, goiung for 1001 [07:52:23] cloudcumin1001 done too, ofc keyholder rearmed on both and tested that cumin works [08:08:13] I'll take care of the rabbit clusters [08:14:54] ack, for a graceful shutdown (i.e. rabbit server gets shut down) there should be nothing special to do [08:15:20] I'm following https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Maintenance#cloudrabbitXXXX [08:16:00] * godog nods [08:41:57] I'll do cloudnet, codfw first then eqiad [08:52:34] k [09:11:45] cloudrabbit clusters done [09:12:47] do we hav e a procedure for cloudcephmon? I can't find one on wikitech [09:13:23] dhinus: the clouddb* in T426563 is something you would normally take care of or we (infra) should? [09:13:45] I can do it, there are also some pending reboots from before [09:14:10] pending reboots: T422527 [09:14:11] T422527: [wikireplicas] Upgrade clouddbs to 10.11.16 - https://phabricator.wikimedia.org/T422527 [09:15:53] volans: not as far as I'm aware, though sth like "one at a time, wait for ceph to be happy, move on" I'd guess [09:17:11] I have the same guess, but not sure if there is anything specific to check/do before/after each one [09:18:14] mmh maybe 'ceph health' is enough [09:19:56] dhinus: ack, thx [09:20:02] maybe wmcs.ceph.reboot_node does the right thing already [11:13:57] * volans lunch, tasks updated with the reboots I've done, so anyone feel free to pick it up from the task [11:39:50] i couldn't unsee this after first noticing: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1289319 [11:53:36] heheheh [12:19:40] lol [12:32:38] * volans resuming some cloudvirt reboots (from 1044) [12:33:14] are we not using the cookbook option to reboot all of them in sequence? [12:33:54] there is a warning on wikitech that says it will fail if a VM cannot be migrated and then is a bit of a mess [12:34:07] also doesn't have any flexibility, it's all or 1, no fine-grained selection [12:35:03] it should now prompt you what to do if migration fails, iirc [12:35:47] is there a way to start it skipping the hosts already done? IMHO it should have either a check of last reboot before X OR checking the kernel version [12:41:50] T419967 heh [12:41:50] T419967: Add --min-uptime to cookbooks - https://phabricator.wikimedia.org/T419967 [12:42:02] surprise, we've been here not even a month ago [12:42:31] https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1289335 [12:45:46] taavi: validate_version is in wmcs_libs/k8s/kubernetes.py afaics [12:46:48] we don't need that, we can just use packaging.version,Version [12:47:14] validate_version should be ditched entirely imho [12:48:33] godog: see the parent patch [12:53:28] got it, ok! [12:54:24] I don't feel strongly one way or another tbh, I do feel the less code we have the better though [12:55:00] volans: sure but validate_version() is something we're already using everywhere else, I'd rather not block this on that [12:58:44] as you want [13:15:46] 15:11:55 alerts: commands[0]> pytest -vvvv modules/alerts/files [13:15:46] 15:11:55 alerts: exit 2 (0.01 seconds) /srv/workspace/puppet> pytest -vvvv modules/alerts/files [13:15:49] 15:11:55 alerts: FAIL ✖ in 2.46 seconds [13:15:49] ... and ? [13:17:41] https://docs.pytest.org/en/stable/reference/exit-codes.html "Exit code 2: Test execution was interrupted by the user"? [13:19:06] heh yeah I think I got it, when I added 'deps' the parent env deps stopped being inherited [13:19:35] ahhhh [13:19:50] so adding it to the container directly would have worked as well then? [13:20:23] I think so yeah, tox/venv is probably so old in bullseye that still defaults to showing system packages [13:20:30] your solution via tox.ini is the right one tho [13:28:36] Good morning! I'm around for a bit but then need to go renew my driver's license. [13:29:03] I'm happy to take on another round of reboots if there are any left by the end of the euro workday [13:31:22] ack thank you andrewbogott, I'm sure there will some left [14:44:37] * andrewbogott -> DMV [17:14:14] I'm back! In case anyone else is still around... is there a background cookbook running someplace that's cycling cloudvirts or should I take the list on T426560 as my starting point? [17:21:47] andrewbogott: we added a flag to the reboot all cloudvirts cookbook to exclude the upgraded kernel version, so in theory now we can just launch that cookbook and let it do its thing in the background instead of doing them one by one [17:24:40] that's great! Is that change merged already? And is that cookbook now doing that, in the background someplace? Or shall I start it now? [17:26:08] merged but not yet tested. https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Maintenance#cloudvirtXXXX should have the correct command, 6.12.88 is the fixed kernel [17:26:26] cool, I will give it a go! [17:29:39] the cookbook does some string matching so I can just say "--exclude-kernel 6.12.88" and not "--exclude-kernel 6.12.88+deb13-amd64" ? [17:30:37] '6.12.88' is the syntax of the puppet fact it checks against [17:31:50] facter! ok. [17:45:22] taavi: I'm having a hard time finding docs for argparse taking 'validate' as an arg to add_argument -- maybe you're expecting a new version of argparse than we have on the cumin nodes? [17:45:37] "TypeError: _AppendAction.__init__() got an unexpected keyword argument 'validate'" [17:46:29] bah, should be type= not validate= [17:46:33] not sure how I missed that [17:49:00] https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1289400 [17:49:38] that should do it [18:00:32] cookbook launches now, it's starting with 1044 which seems like a good sign [22:19:52] https://disabled-tools.toolforge.org/ is showing a lot of tools that have been disabled for over 40 days and still not archived. I haven't poked around to figure out what broke.