[08:41:01] <dcaro>	 morning
[08:41:36] <taavi>	 o/
[08:45:52] <arturo>	 o/
[09:24:55] <dcaro>	 quick review, currently pushing images through cookbooks is broken (the auth in image-builder is set to .svc.t.o only, not urgent but should not be left like that for long) https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1148809/1
[09:26:30] <arturo>	 dcaro: I wonder if the many k8s manifests we have will need to update to point to the new registry FQDN ?
[09:26:51] <dcaro>	 yep, same for ci and puppet
[09:26:52] <arturo>	 +1'd
[09:26:59] <dcaro>	 ex. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1148808
[09:27:27] <arturo>	 cool
[09:27:29] <arturo>	 +1'd
[09:27:57] <dcaro>	 thanks, there's a bit of an order issue with puppet/ci as there's an allowlist of repositories and such
[09:28:21] <arturo>	 yeah
[09:30:00] <dcaro>	 hm... having multiple documents in a yaml breaks pre-commit, I guess we don't run it in the ci of the repo? https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/blob/main/tofu-provisioning/tofu-provisioning.yaml?ref_type=heads
[09:30:53] <arturo>	 unfortunatelly, that's how gitlab requires inputs to be defined. Not sure if they can be decoupled on a different file
[09:31:50] <arturo>	 we don't seem to have any linting for that repo
[09:31:55] <taavi>	 please review: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/38
[09:32:53] <arturo>	 taavi: +1
[09:33:14] <arturo>	 *linting via gitlab CI anyway
[09:33:52] <dcaro>	 what is the scheduler hint group `f535f4b8-8450-4ffd-82c8-e4eb8c9d4924` ?
[09:34:14] <arturo>	 is that an openstack uuid=
[09:34:15] <arturo>	 ?
[09:34:23] <taavi>	 the anti-affinity server group most likely
[09:34:43] <dcaro>	 is that in tofu?
[09:34:51] <taavi>	 yes
[09:36:31] <dcaro>	 ack
[09:37:01] <dcaro>	 defined inside the service module 👍
[09:38:04] <taavi>	 minor problem: the volume https://horizon.wikimedia.org/project/volumes/7bd1ad05-63a9-453e-81a5-b8adf444ce08/ is stuck in "detaching" state
[09:38:43] <arturo>	 see if that's one of the few that andrew detected when draining cloudvirts?
[09:40:09] <taavi>	 that volume was assigned to cloudcontrol1006, and cinder-volume.service was stopped on cloudcontrol1006
[09:40:14] <taavi>	 i started it and that fixed the problem
[09:41:19] <arturo>	 umm. I have some questions now
[09:41:37] <arturo>	 how can a volume be "assigned" to an individual cloudcontrol?
[09:41:56] <arturo>	 if puppet was not disabled, how was the service stopped?
[09:43:29] <dcaro>	 quick review https://gerrit.wikimedia.org/r/c/operations/puppet/+/1148808 , will need some time to propagate to all the gitlab workers
[09:43:48] <arturo>	 dcaro: already +1d
[09:43:54] <dcaro>	 thanks
[09:44:13] <taavi>	 dcaro: would that restart kubelet globally?
[09:44:27] <taavi>	 arturo: don't fully understand why, but there's a host field in the cinder database for each volume
[09:44:29] <taavi>	 >                        host: cloudcontrol1006@rbd#RBD
[09:44:45] <dcaro>	 taavi: I thought it wouldn't, but let me double check
[09:44:51] <taavi>	 for the second, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1148815
[09:46:34] <arturo>	 taavi: +1d
[09:47:01] <dcaro>	 taavi: it would restart containerd yep
[09:47:44] <dcaro>	 should be harmless, but let me do it controlledly to avoid hiccups
[09:48:13] <taavi>	 maybe split the gitlab part to a separate patch so that we can roll those out separately?
[09:56:10] <dcaro>	 went seamless in toolsbeta
[09:56:13] <dcaro>	 https://www.irccloud.com/pastebin/As0641ok/
[09:58:39] <dcaro>	 I think it's ok to rollout on tools, I'll do
[09:59:23] <taavi>	 i guess the restart was short enough that kubernetes didn't see the node going down and started trying to restart / reschedule things?
[10:00:23] <dcaro>	 containerd restart is really quick, and does not stop any containers
[10:00:55] <taavi>	 oh cool
[10:01:50] <taavi>	 please review: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1148819
[10:03:31] <dcaro>	 +1d
[10:19:12] <dcaro>	 removed docker::registry_url from the hiera settings too, it was not used since 2016 xd
[10:40:17] <dcaro>	 created T394902 to follow up on the many other places that need changes
[10:40:17] <stashbot>	 T394902: [k8s,infra] use the new docker-registry.svc.toolforge.org host everywhere - https://phabricator.wikimedia.org/T394902
[10:43:26] <moritzm>	 did the disabling of host-based auth in tools-beta cause any issues? are we still good to also disable it for toolforge at large next?
[10:43:59] <arturo>	 moritzm: I don't think we have detected any issues
[10:46:35] <taavi>	 moritzm: I disabled it there as well yesterday
[10:47:20] <moritzm>	 ah, great! I'll prep a followup to remove the function from puppet later
[10:47:46] <taavi>	 thank you!
[10:50:36] <dcaro>	 taavi: the probe alerts for tools legacy beta is expected right? (from your patches to use ip6 too)
[10:50:52] <taavi>	 toolsbeta? huh, give me a second
[10:55:28] <dcaro>	 looks fixed  :)
[10:55:30] * dcaro lunch
[12:54:50] <taavi>	 yeah, false positivehttps://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/40
[12:59:44] <taavi>	 oops
[12:59:47] <taavi>	 https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/40
[13:07:51] <arturo>	 taavi: LGTM.
[13:23:11] <dcaro>	 I just tried to redeploy kyverno (with the new imaga path), and it timed out in the post-install hooks, I did not find anything weird, but I'm looking, let me know if you find any issues
[13:29:14] <dcaro>	 as in using the newer repo url svc.t.o, not using a new version
[13:29:50] <arturo>	 I can help you debug when I finish lunch time
[13:45:07] <andrewbogott>	 Has someone already investigated the clouddumps 'puppet constant change' thing? Last time I checked it went away as soon as I looked but it seems to be back for real.
[13:45:50] <andrewbogott>	 > that volume was assigned to cloudcontrol1006, and cinder-volume.service was stopped on cloudcontrol1006
[13:45:57] <andrewbogott>	 taavi, was puppet also stopped there? ^
[13:46:13] <taavi>	 andrewbogott: no
[13:46:25] <andrewbogott>	 Hm, but puppet didn't start cinder-volume
[13:46:29] <taavi>	 yes, fixed already
[13:46:29] <andrewbogott>	 that needs fixing
[13:46:35] <andrewbogott>	 ok :)
[13:46:46] <taavi>	 for the clouddumps issue, I believe b.tullis was looking at that. not sure if there's a task for it already, ping dhinus 
[13:47:24] <taavi>	 basically it was a missing ensure => running from the service definition https://gerrit.wikimedia.org/r/c/operations/puppet/+/1148815
[13:47:32] <andrewbogott>	 arturo: I think 'assigned' is an artifact from non-ceph setups where volumes are actually local on particular servers. It's silly but there are some poorly-documented cli options to reassign things which I haven't used much.
[13:47:58] <andrewbogott>	 taavi: did you already check if the other cinder services were wrong in the same way?
[13:48:34] <taavi>	 i checked the other cinder services, but not any other openstack services
[13:48:50] <andrewbogott>	 ok. thank you for fixing!
[14:00:35] <taavi>	 next up: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/41
[14:06:50] <dhinus>	 for the clouddumps alerts, I pinged btullis and created T394921
[14:06:50] <stashbot>	 T394921: PuppetConstantChange on clouddumps100[12] - https://phabricator.wikimedia.org/T394921
[14:07:03] <taavi>	 thanks!
[14:07:47] <andrewbogott>	 thanks dhinus, I will continue to ignore
[14:10:39] <arturo>	 dcaro: is kyverno still struggling?
[14:11:34] <dcaro>	 arturo: it actually did not fail anything else, just timed out when deploying, but it deployed all the images correctly, tests pass, I see no errors in the logs, so not sure
[14:11:51] <dcaro>	 do you know if the post hooks of the helm chart do something special?
[14:11:52] <arturo>	 ok, I'm seeing it healthy too
[14:12:03] <dcaro>	 (I pasted the output of the deploy in the MR)
[14:12:28] <arturo>	 I can read the hook
[14:12:31] <arturo>	 do you have a link?
[14:13:41] <dcaro>	 https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/793#260628ffacc8d788d8a6d1ce95647860d7d67ffb
[14:14:46] <taavi>	 that was locally, not in toolsbeta, right?
[14:15:15] <taavi>	 anything interesting in `kubectl get event -n kyverno`?
[14:16:08] <arturo>	 this is in tools
[14:16:13] <taavi>	 oh
[14:16:14] <arturo>	 nothing interesting in the events
[14:16:23] <arturo>	 just normal things
[14:16:29] <dcaro>	 in toolsbeta it did not have issues
[14:59:19] <taavi>	 still looking for a review on https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/41
[15:02:29] <arturo>	 👀
[15:03:49] <dhinus>	 taavi: why is the CI failing with ERROR: Job failed: exit code 99?
[15:04:30] <taavi>	 that's tofu for "this will change something"
[15:04:31] <arturo>	 dhinus: that's the special exit code that indicates the tofu diff is not a NOOP, so gitlab can print the job as warning
[15:05:02] <arturo>	 logic is here: https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/blob/main/tofu-provisioning/tofu-provisioning.yaml?ref_type=heads#L75
[15:05:18] <dhinus>	 ah I see, thanks
[15:10:46] <arturo>	 taavi: +1d
[15:11:36] <arturo>	 I think mixing imports/moved block cleanups with other code changes may cause problems at some point, for example a revert that is no longer clean, or similar
[15:11:55] <arturo>	 I would suggest doing import/move block cleanups in separate patches
[15:12:15] <taavi>	 yeah, good point
[15:15:01] <arturo>	 looking at https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/commits/main?ref_type=heads the number and complexity of changes happening on the repo, in my opinion is a testament to how deep we were missing tofu for tools/toolsbeta.
[15:15:41] <taavi>	 final thing from me today: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/42
[15:15:46] <arturo>	 in my opinion, having this repo is a collective success for the team that will only enable smoother operations, and make your life easier
[15:17:51] <arturo>	 also, the repo stands on the shoulder of giants... public openstack APIs, TLS for the openstack APIs, etc. Projects that also took a significant engineering time to introduce
[15:19:51] <dhinus>	 arturo: yep I was also very positively impressed by all the MRs created in the last few days!
[15:20:19] <dhinus>	 thanks arturo for pushing for this, and thanks chuckonwu for writing all of the import code!
[15:20:30] <arturo>	 <3
[15:20:42] <chuckonwu>	 <3
[15:34:18] <dcaro>	 +1 the deployment I did yesterday also went pretty smoothly!
[15:48:20] <dcaro>	 chuckonwu: just updated the task of the warning message with some "tips", but feel free to ask for more guidance, I have ~10min before the next meeting but we can chat after or tomorrow too (or async in the task/irc/...)
[15:48:44] <chuckonwu>	 👍
[17:14:22] * dcaro off
[17:14:25] <dcaro>	 cya tomorrow!
[17:15:25] <dcaro>	 taavi: random question, libup only works for gerrit? (as in, can it be used for giltab? might be better than our silly poetry upgrade script)
[18:06:51] * dhinus offline