[07:54:46] <godog>	 greetings
[07:59:10] <dcaro>	 morning
[08:01:12] <godog>	 good luck re: k8s upgrade
[08:03:35] <dcaro>	 :), should be ok, thanks anyhow though!
[08:09:53] <dhinus>	 morning
[08:15:04] <dcaro>	 \o, first issue, the functional tests are not running on the newer bastion
[08:15:23] <dcaro>	 it seems to have issues with recreating the venv for them
[08:15:25] <dcaro>	 https://www.irccloud.com/pastebin/0XBttrah/
[08:18:23] <dcaro>	 I think I found the issue
[08:18:33] <godog>	 hehe a beta environment already pays off to spot problems
[08:24:20] <dcaro>	 yep :), the issue is that trixie uses python 3.13, while the venv is created with python 3.11 (using `toolforge webservice shell`, as there's no venv module installed in the bastions)
[08:25:03] <dcaro>	 this makes me think that the deployment script is not using the newer bastion either (or it would have failed before)
[08:36:48] <dcaro>	 I ended up using the cookbooks, will handle the trixie bastion move after
[08:39:37] <dcaro>	 the cookbook upgrading the first control node failed
[08:39:39] <dcaro>	 looking
[08:40:27] <dhinus>	 dcaro: can you run the tests on the old bastion that is still on bookworm?
[08:41:03] <dcaro>	 yep, that's what the cookbook does, that's what I'm running (`wmcs.toolforge.run_tests`)
[08:42:42] <dcaro>	 hmm, the cookbook failed trying to fetch the nodes using `sudo -i kubectl get nodes --output=json --selector='kubernetes.io/hostname=toolsbeta-test-k8s-control-10'`, exit code 99
[08:42:46] <dcaro>	 running it now works
[08:43:27] <dhinus>	 ah I see, I thought you meant the upgrade cookbook, forgot about the run_tests cookbook
[08:44:05] <dcaro>	 I'll manually uncordon
[08:47:31] <dcaro>	 hmm... I suspect that the apiserver might have had a small blip and that might have made the kubectl get nodes fail once :/
[08:51:49] <dcaro>	 I added a note on the task, not sure what happened though
[08:54:06] <dhinus>	 yeah it didn't happen during the previous upgrade
[08:56:07] <dcaro>	 the api-server was unhealthy for a second while coming up with the new version, that might have made the kubectl command fail, I'll upgrade the rest see if it happens again, if so, might be worth adding some extra wait/check for it to finish coming up
[09:07:40] <dcaro>	 second controller worked ok, going for the third
[09:14:25] <dhinus>	 FYI there was a toolsbeta alert firing for 1 minute, not sure if it was supposed to be silenced
[09:15:38] <dcaro>	 oh, I added a silence
[09:15:43] <dcaro>	 (and the cookbook added another one too)
[09:16:03] <dcaro>	 we don't have history of those right?
[09:16:09] <dcaro>	 do you remember which alert it was?
[09:17:07] <dcaro>	 hmm... this is the one I created efe40a3f-2ce4-4528-a0d7-192dd08b995a, but it's not there anymore :/
[09:17:35] <dcaro>	 oh, I think it might have expired, as the `!ACK` thingie only renews the silence if any alert matches xd
[09:20:09] <dcaro>	 recreated it
[09:29:31] <dcaro>	 everything looks ok, I'll start with the workers
[09:30:53] <dcaro>	 do we have anything that can create tasks from the cli?
[09:31:10] <dcaro>	 (/me thinking on adding a 'create upgrade tasks' cookbook)
[09:38:08] <dcaro>	 non-nfs workers done, going with nfs
[09:40:40] <dhinus>	 the alert was "Toolforge HAproxy server down", we don't have history but we have #-cloud-feed :)
[09:42:19] <dcaro>	 👍
[09:42:41] <dhinus>	 creating tasks with a cookbook could be a nice idea, especially to automate the list of nodes
[09:43:17] <dhinus>	 the next step is a cookbook that does the upgrade and resolves the task :D
[09:49:03] <dcaro>	 once we have phabricator integration that should not be too hard either :)
[09:52:47] <dcaro>	 workers upgraded, will do a full run of the functional tests just in case
[10:09:32] <dcaro>	 where did test-cookbooks leave the spicerack logs?
[10:09:53] <dcaro>	 oh, it's in the --help, nice
[10:10:17] <dcaro>	 hmm... that dir does not exist
[10:10:55] <dcaro>	 oh, there's a typo
[10:11:58] <volans>	 your home
[10:12:11] <volans>	 cookbooks_testing/
[10:12:21] <dcaro>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1183625
[10:12:36] <volans>	 doh
[10:12:47] <volans>	 thanks
[10:12:48] <dcaro>	 volans: now that you are around :), is there any phabricator integration for cookbooks?
[10:13:18] <volans>	 https://doc.wikimedia.org/wmflib/master/api/wmflib.phabricator.html
[10:13:34] <volans>	 that is accessible via https://doc.wikimedia.org/spicerack/master/api/index.html#spicerack.Spicerack.phabricator
[10:13:56] <dcaro>	 nice, thanks!
[10:14:02] <volans>	 it's rather simple and so far only things needed had been added, that was mostly comment on task
[10:14:21] <volans>	 not create/search/modify
[10:14:41] <volans>	 but can be expanded at will
[10:15:28] <dcaro>	 ack, might be a nice side-project
[10:18:25] <dcaro>	 okok, I'll do the ingress nodes after lunch, I wrote also some improvements for the worker upgrade cookbooks that I'm testing, no rush
[11:33:36] * dcaro trying the new cookbook changes for the upgrade
[11:47:14] <godog>	 I'm looking at the big spikes up in ceph dashboards, not sure if T403390 rings a bell?
[11:47:21] <stashbot>	 T403390: Investigate big spikes up in wmcs ceph dashboards - https://phabricator.wikimedia.org/T403390
[12:06:15] <dcaro>	 that seems to match the decomissions that happened that day
[12:13:07] <dcaro>	 nah
[12:13:40] <dcaro>	 it matches an undrain though
[12:21:48] <godog>	 case 2 right ?
[12:25:31] <dcaro>	 case 1, `14:05 	<andrew@cloudcumin1001> 	START - Cookbook wmcs.ceph.osd.undrain_node`
[12:26:02] <dcaro>	 from sal https://sal.toolforge.org/log/BIMm9pgB8tZ8Ohr0F9o5
[12:26:23] <godog>	 hah! indeed
[12:28:00] <dcaro>	 case 2 might match when the cookbook tweaked the weights or something during that run (red herring caution xd), not sure
[12:28:30] <godog>	 hehe yeah case 2 lines up with an mgr restart, haven't looked into what triggered it tho
[12:29:08] <godog>	 I'll update the description with your finding
[12:36:06] <dcaro>	 hmm, mgr restart, that's not part of the drain/undrain/bootstrap/destroy cookbooks
[12:37:47] <godog>	 mmhh ok then must have been manual
[12:38:12] <godog>	 anyways not a huge deal per-se, I'll keep looking
[12:40:10] <dcaro>	 it might have been a bug or crash or something
[12:40:25] <dcaro>	 it will restart itself
[12:59:56] <godog>	 true, I'll add that info as well
[13:00:56] <dcaro>	 added a note there, I slightly remember tweaking the scrape interval, and the mgr being overwhelmed by the prometheus scraping, but it was long time ago
[13:01:24] <godog>	 ack, thank you !
[13:52:56] <dcaro>	 git some patches improving the tools upgrade process (batching workers/ingresses for now) https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1183682
[13:53:27] <dcaro>	 *got
[14:25:47] <dcaro>	 upgrade finished in toolsbeta \o/
[14:26:20] <dcaro>	 (I took my time writing the cookbooks and double checking thingsxd)
[14:27:34] <godog>	 neat! I'm taking a look at the new cookbook
[14:29:02] <dcaro>	 the last patch is failing ci, will fix
[16:56:08] <dcaro>	 quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/disable-tool/-/merge_requests/23
[16:56:40] <dcaro>	 I'm moving things out of bullseye CI images so we can stop updating those (I have to update them now for the month pre-commit updates)
[17:16:32] <dhinus>	 dcaro: +1d
[17:16:39] <dcaro>	 thanks!
[17:16:51] * dhinus off
[17:17:17] * dcaro off