[08:41:01] morning [08:41:36] o/ [08:45:52] o/ [09:24:55] quick review, currently pushing images through cookbooks is broken (the auth in image-builder is set to .svc.t.o only, not urgent but should not be left like that for long) https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1148809/1 [09:26:30] dcaro: I wonder if the many k8s manifests we have will need to update to point to the new registry FQDN ? [09:26:51] yep, same for ci and puppet [09:26:52] +1'd [09:26:59] ex. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1148808 [09:27:27] cool [09:27:29] +1'd [09:27:57] thanks, there's a bit of an order issue with puppet/ci as there's an allowlist of repositories and such [09:28:21] yeah [09:30:00] hm... having multiple documents in a yaml breaks pre-commit, I guess we don't run it in the ci of the repo? https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/blob/main/tofu-provisioning/tofu-provisioning.yaml?ref_type=heads [09:30:53] unfortunatelly, that's how gitlab requires inputs to be defined. Not sure if they can be decoupled on a different file [09:31:50] we don't seem to have any linting for that repo [09:31:55] please review: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/38 [09:32:53] taavi: +1 [09:33:14] *linting via gitlab CI anyway [09:33:52] what is the scheduler hint group `f535f4b8-8450-4ffd-82c8-e4eb8c9d4924` ? [09:34:14] is that an openstack uuid= [09:34:15] ? [09:34:23] the anti-affinity server group most likely [09:34:43] is that in tofu? [09:34:51] yes [09:36:31] ack [09:37:01] defined inside the service module 👍 [09:38:04] minor problem: the volume https://horizon.wikimedia.org/project/volumes/7bd1ad05-63a9-453e-81a5-b8adf444ce08/ is stuck in "detaching" state [09:38:43] see if that's one of the few that andrew detected when draining cloudvirts? [09:40:09] that volume was assigned to cloudcontrol1006, and cinder-volume.service was stopped on cloudcontrol1006 [09:40:14] i started it and that fixed the problem [09:41:19] umm. I have some questions now [09:41:37] how can a volume be "assigned" to an individual cloudcontrol? [09:41:56] if puppet was not disabled, how was the service stopped? [09:43:29] quick review https://gerrit.wikimedia.org/r/c/operations/puppet/+/1148808 , will need some time to propagate to all the gitlab workers [09:43:48] dcaro: already +1d [09:43:54] thanks [09:44:13] dcaro: would that restart kubelet globally? [09:44:27] arturo: don't fully understand why, but there's a host field in the cinder database for each volume [09:44:29] > host: cloudcontrol1006@rbd#RBD [09:44:45] taavi: I thought it wouldn't, but let me double check [09:44:51] for the second, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1148815 [09:46:34] taavi: +1d [09:47:01] taavi: it would restart containerd yep [09:47:44] should be harmless, but let me do it controlledly to avoid hiccups [09:48:13] maybe split the gitlab part to a separate patch so that we can roll those out separately? [09:56:10] went seamless in toolsbeta [09:56:13] https://www.irccloud.com/pastebin/As0641ok/ [09:58:39] I think it's ok to rollout on tools, I'll do [09:59:23] i guess the restart was short enough that kubernetes didn't see the node going down and started trying to restart / reschedule things? [10:00:23] containerd restart is really quick, and does not stop any containers [10:00:55] oh cool [10:01:50] please review: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1148819 [10:03:31] +1d [10:19:12] removed docker::registry_url from the hiera settings too, it was not used since 2016 xd [10:40:17] created T394902 to follow up on the many other places that need changes [10:40:17] T394902: [k8s,infra] use the new docker-registry.svc.toolforge.org host everywhere - https://phabricator.wikimedia.org/T394902 [10:43:26] did the disabling of host-based auth in tools-beta cause any issues? are we still good to also disable it for toolforge at large next? [10:43:59] moritzm: I don't think we have detected any issues [10:46:35] moritzm: I disabled it there as well yesterday [10:47:20] ah, great! I'll prep a followup to remove the function from puppet later [10:47:46] thank you! [10:50:36] taavi: the probe alerts for tools legacy beta is expected right? (from your patches to use ip6 too) [10:50:52] toolsbeta? huh, give me a second [10:55:28] looks fixed :) [10:55:30] * dcaro lunch [12:54:50] yeah, false positivehttps://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/40 [12:59:44] oops [12:59:47] https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/40 [13:07:51] taavi: LGTM. [13:23:11] I just tried to redeploy kyverno (with the new imaga path), and it timed out in the post-install hooks, I did not find anything weird, but I'm looking, let me know if you find any issues [13:29:14] as in using the newer repo url svc.t.o, not using a new version [13:29:50] I can help you debug when I finish lunch time [13:45:07] Has someone already investigated the clouddumps 'puppet constant change' thing? Last time I checked it went away as soon as I looked but it seems to be back for real. [13:45:50] > that volume was assigned to cloudcontrol1006, and cinder-volume.service was stopped on cloudcontrol1006 [13:45:57] taavi, was puppet also stopped there? ^ [13:46:13] andrewbogott: no [13:46:25] Hm, but puppet didn't start cinder-volume [13:46:29] yes, fixed already [13:46:29] that needs fixing [13:46:35] ok :) [13:46:46] for the clouddumps issue, I believe b.tullis was looking at that. not sure if there's a task for it already, ping dhinus [13:47:24] basically it was a missing ensure => running from the service definition https://gerrit.wikimedia.org/r/c/operations/puppet/+/1148815 [13:47:32] arturo: I think 'assigned' is an artifact from non-ceph setups where volumes are actually local on particular servers. It's silly but there are some poorly-documented cli options to reassign things which I haven't used much. [13:47:58] taavi: did you already check if the other cinder services were wrong in the same way? [13:48:34] i checked the other cinder services, but not any other openstack services [13:48:50] ok. thank you for fixing! [14:00:35] next up: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/41 [14:06:50] for the clouddumps alerts, I pinged btullis and created T394921 [14:06:50] T394921: PuppetConstantChange on clouddumps100[12] - https://phabricator.wikimedia.org/T394921 [14:07:03] thanks! [14:07:47] thanks dhinus, I will continue to ignore [14:10:39] dcaro: is kyverno still struggling? [14:11:34] arturo: it actually did not fail anything else, just timed out when deploying, but it deployed all the images correctly, tests pass, I see no errors in the logs, so not sure [14:11:51] do you know if the post hooks of the helm chart do something special? [14:11:52] ok, I'm seeing it healthy too [14:12:03] (I pasted the output of the deploy in the MR) [14:12:28] I can read the hook [14:12:31] do you have a link? [14:13:41] https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/793#260628ffacc8d788d8a6d1ce95647860d7d67ffb [14:14:46] that was locally, not in toolsbeta, right? [14:15:15] anything interesting in `kubectl get event -n kyverno`? [14:16:08] this is in tools [14:16:13] oh [14:16:14] nothing interesting in the events [14:16:23] just normal things [14:16:29] in toolsbeta it did not have issues [14:59:19] still looking for a review on https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/41 [15:02:29] 👀 [15:03:49] taavi: why is the CI failing with ERROR: Job failed: exit code 99? [15:04:30] that's tofu for "this will change something" [15:04:31] dhinus: that's the special exit code that indicates the tofu diff is not a NOOP, so gitlab can print the job as warning [15:05:02] logic is here: https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/blob/main/tofu-provisioning/tofu-provisioning.yaml?ref_type=heads#L75 [15:05:18] ah I see, thanks [15:10:46] taavi: +1d [15:11:36] I think mixing imports/moved block cleanups with other code changes may cause problems at some point, for example a revert that is no longer clean, or similar [15:11:55] I would suggest doing import/move block cleanups in separate patches [15:12:15] yeah, good point [15:15:01] looking at https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/commits/main?ref_type=heads the number and complexity of changes happening on the repo, in my opinion is a testament to how deep we were missing tofu for tools/toolsbeta. [15:15:41] final thing from me today: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/42 [15:15:46] in my opinion, having this repo is a collective success for the team that will only enable smoother operations, and make your life easier [15:17:51] also, the repo stands on the shoulder of giants... public openstack APIs, TLS for the openstack APIs, etc. Projects that also took a significant engineering time to introduce [15:19:51] arturo: yep I was also very positively impressed by all the MRs created in the last few days! [15:20:19] thanks arturo for pushing for this, and thanks chuckonwu for writing all of the import code! [15:20:30] <3 [15:20:42] <3 [15:34:18] +1 the deployment I did yesterday also went pretty smoothly! [15:48:20] chuckonwu: just updated the task of the warning message with some "tips", but feel free to ask for more guidance, I have ~10min before the next meeting but we can chat after or tomorrow too (or async in the task/irc/...) [15:48:44] 👍 [17:14:22] * dcaro off [17:14:25] cya tomorrow! [17:15:25] taavi: random question, libup only works for gerrit? (as in, can it be used for giltab? might be better than our silly poetry upgrade script) [18:06:51] * dhinus offline