[07:20:14] morning [07:26:44] morning [07:35:22] morning [07:41:40] for some reason tekton stopped working on lima kilo for me yesterday (started from scratch twice already), with that AppArmor message above, looking [07:55:31] dcaro: it could be PSP, it was recently enabled in limakilo [08:02:14] I think it's a combination of that + apparmor inside the kind running inside lima-kilo [08:40:58] Good morning. Would someone possibly be able to have a quick look at this Ceph related patch for me please, if it's not inconvenient? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1025428 [08:41:23] 👀 [08:41:42] It should be a no-op on your clusters until they are reimaged, at which point the ceph user and group will be assigned a static uid/gid value. Thanks. [08:41:48] have you tested it? I suspect that there might be other dependencies on that order outside that module [08:43:07] I have not tested it, but I'm currently reimaging cephosd100[1-5] several times so I should be able to test it on our cluster pretty quickly and revert if anything crops up. [08:43:23] ack [08:45:32] gerrit says there's no blame info for the origin of that requires xd https://gerrit.wikimedia.org/r/c/operations/puppet/+/583964/5/modules/ceph/manifests/config.pp [08:46:16] found it https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/0b1f1a27f621a681e740c02bea5db09f9551d95b%5E%21/#F0 [08:46:49] sounds like the intention was to not use the user created by the package, and probably not relevant anymore [08:48:06] i have vague memories of something having issues with user ids in the 60000-64999 range (https://wikitech.wikimedia.org/wiki/UID#UID_ranges), but iirc we fixed that? [08:48:26] (that issue is what the commit message is likely referring to) [08:54:47] Great, many thanks both. The main thing is that I didn't want to add a `uid => 64045` to the existing user block, otherwise that would have changed the ownership on your clusters, which would probably have been bad. This way it should only take effect on reimages. [08:55:04] I mean, we have some nodes running with the high UID without issues (so far), so I think it might be fixed [08:56:13] btullis: yep, that makes sense, appreciated, I have pending a fleet-wide reimage that will sort that out eventually [08:59:20] Great!. I also have one niggly bug to sort out in https://phabricator.wikimedia.org/T332987 which causes our mon servers to fail on first puppet run. I may send some patches to try to address this. [08:59:39] However, but I'll put conditionals in so that they should again be a noop on your clusters, at least until we have tested them on ours. [09:02:18] ack thanks [09:28:43] quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/43 [10:24:18] dcaro: LGTM [10:28:14] hmmm, on the generate merged api, the jobs api seems to mess with the top-level organization of the shown docs: https://usercontent.irccloud-cdn.com/file/naDY22bq/image.png [10:29:01] those are the tags I think [10:29:04] that would be the tags for operation grouping [10:29:50] you may want to filter out tags in the merged document [10:29:52] I'm not sure it helps much using that grouping at the top level [10:30:18] if I remove them from the merged document, they make no sense in the non-merged one either no? [10:31:04] correct, but you may want to filter them anyway, just in case in the future they are reintroduced [10:31:40] sure [10:39:52] dokploy, another "alternative" to heroku https://github.com/Dokploy/dokploy [10:44:04] cool, the repo is two weeks old xd [10:50:16] I would like someone to not only review https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/19 but run in in the laptop. An user with linux + an user with MacOSX would be ideal, cc dhinus [10:59:27] * dcaro lunch [10:59:32] I can take a look after lunch [10:59:38] thanks [11:34:26] As promised, here is a patch for ceph mon bootstrapping that should be a noop for your clusters, but I'd appreciate a review to be on the safe side: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1025724 [11:45:04] btullis: LGTM. Please also collect +1 from dcaro [11:51:35] Ack, will do. Thanks. [12:07:54] thanks! [12:14:55] arturo: I'm testing https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/19 [12:15:07] dhinus: thanks! [12:18:32] I'm getting some ansible errors updating my existing VM, if you know of a quick fix, otherwise I'll just destroy and recreate [12:18:37] I think it's related to the hostname change [12:18:56] "provided hosts list is empty, only localhost is available. Note that the implicit localhost does not match 'all'" [12:19:03] "ERROR! the role 'basic_system' was not found" [12:19:37] yes, recreate from scratch [12:19:46] ok! [12:21:38] dhinus: for completeness, I left here the test steps: [12:21:39] https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/19#note_80400 [12:21:45] thanks [12:31:34] btullis: gave it a look, I think there's one issue (missing changing the relationships), but I suggested also a different approach, let me know if you like it. It will require some changes also on the cloud manifests to adapt, though they should have no changes [12:31:51] in any case, no need to take me up on the refactor :), fixing the relationship might be enough [12:42:18] dcaro: Thanks, I like the look of it. I will have a go at that now. [12:57:00] arturo: I got `FATA[0000] unknown flag: --tty ` [12:57:08] I think my lima-kilo version might not be the same [12:57:15] *limactl [13:00:05] downloading a newer one... [13:23:59] I think that the ceph auth refactor is too tricky for today after all, so I'm just going to focus on the small test, if that's OK. [14:35:59] btullis: sure, I'm out tomorrow though, so if you want my review will come on thursday [14:36:06] (/me just out of a meeting) [14:36:25] arturo: I get this [14:36:28] https://www.irccloud.com/pastebin/oL0ox4T8/ [14:36:31] looking [14:36:49] the venv should be created by the Dockerfile for the test environment [14:37:13] https://www.irccloud.com/pastebin/FQmcCiVA/ [14:37:19] it's not there, let me see the docker build logs [14:38:10] I see it ran ok, though maybe it's in a different directory? [14:38:12] https://www.irccloud.com/pastebin/WJtK3Wbb/ [14:38:36] can you check the image being used by the deployment? [14:39:43] it's using the image [14:39:44] image: toolsbeta-harbor.wmcloud.org/toolforge/maintain-kubeusers:image-0.0.125-20240425144954-bd3c5d5c [14:39:53] yeah, so that's the problem [14:40:01] somehow the local deploy.sh script is not updating it [14:43:28] I mean, it should be using a local image, not the one from harbor [14:43:34] oh, I ran it again and now it worked :/ [14:43:42] (running pytest) [14:43:52] well, that's something! [14:44:36] hmmm.... maybe it needs two runs? xd [14:45:07] some tests fail, is that expected? [14:45:17] https://www.irccloud.com/pastebin/ZHwTBHEo/ [14:45:21] tests/test_api.py:442: AssertionError [14:45:46] unexpected, that sounds like old code to me [14:46:39] it has the script, I updated after lunch [14:46:45] let me re-fetch [14:46:56] unrelated: metricsinfra-puppetserver seems to be lagging a few weeks behind, puppet-git-sync is failing (cc taavi) [14:47:13] I'm in latest for your branch [14:51:47] btullis: added a note on how to do fix the error you found, for whenever the refactor comes [14:51:58] dcaro: thanks, I'll follow up in a bit [14:53:00] (just in case, the hash is `commit a335ff50d9931e2638db3a0af681cc2685f39c59 (HEAD -> arturo-25-readme-refresh-test, origin/arturo-25-readme-refresh-test)`) [14:56:22] dhinus: ok, do you need help fixing that? [14:56:29] (and why did that not alert?) [14:58:15] yeah I was also expecting an alet [14:58:17] arturo: I suspect might be related to setsid [14:58:18] *alert [14:59:33] maybe not xd, it did deploy ok now again [14:59:52] taavi: it's not just a rebase conflict, I'm looking at the logs but I'm not sure what the problem is [15:00:17] I have a meeting now, if you want to have a look otherwise it's not urgent [15:03:44] Apr 30 15:03:19 metricsinfra-puppetserver-1 git-sync-upstream[1276045]: stderr: 'fatal: Unable to create temporary file '/srv/git/operations/puppet/.git/objects/pack/tmp_pack_XXXXXX': No space left on device [15:03:48] from the logs of the service [15:04:41] seems like a cinder/ceph issue [15:04:42] https://www.irccloud.com/pastebin/fBah6brm/ [15:05:11] https://www.irccloud.com/pastebin/PDlGHhsy/ [15:08:36] hmm... horizon says that the volume is mounted in /dev/sdb [15:08:38] https://usercontent.irccloud-cdn.com/file/suS0Dsef/image.png [15:10:31] I'll reboot, I tried detaching and reatacching but that just borked the mount [15:10:58] did it run out of inodes or something? [15:11:52] yep [15:11:53] https://www.irccloud.com/pastebin/B5GoCDOV/ [15:11:54] xd [15:15:49] any chance anyone knows how to change the inode allocation on the fly? [15:16:37] (I guess that reformatting would be doable though) [15:16:57] we might want to change that wherever we create puppet servers, andrewbogott ? [15:16:58] I have no idea [15:17:48] what the heck? Why using so many inodes? [15:18:02] It should be a vanilla 5Gb cinder volume [15:18:07] that's a good question too xd [15:18:24] though it's 500k inodes in total [15:18:46] taking into account full git repo clones, + probably some puppetmaster shenanigans might not be so crazy [15:18:59] is there a 'du -hs' for inodes instead of space? [15:19:25] yes! [15:19:26] du --inodes -xS [15:20:16] are you running it already? [15:20:22] yep [15:20:33] with -s too (for summary, instead of each and every file) [15:20:47] i will call that a successful nerd snipe [15:21:03] xd [15:21:51] https://www.irccloud.com/pastebin/nYRXTEJ5/ [15:22:47] so that's the deploy script doing something broken [15:23:46] dcaro: I'm going to rm some things so there's room to move [15:23:48] there's 28 of those oot_branch*/hieradata/hosts [15:24:01] yeah, and only 4 branches in the actual git repo [15:24:50] then there's a bunch of oot_branch*vendor_modules too with ~130 files each [15:25:04] (300) [15:25:20] hopefully I've made enough space to rebase --abort, trying that now [15:25:30] ok, done [15:25:48] now I'll look at the deploy script and see why it's leaking an entire git repo on every run :) [15:27:00] ack [15:27:16] https://www.irccloud.com/pastebin/56mAxNer/ [15:27:29] 9% free :) [15:27:46] Oh great, it's a bash script [15:28:45] is that good or bad? [15:29:57] bad [15:30:25] I mean, it's fine, I'm just surprised that a jbond wrote a utility in 2023 using bash and not python [15:31:39] if it's a git hook, maybe wanted to not need a venv/extra dependencies or something [15:31:56] dcaro: regarding maintain_kubeusers, I have the feeling the script is somehow always using the _previous_ container image [15:32:52] hmmm [15:33:00] maybe the diff wrt. my setup is that I built once the mk-test container outside lima-kilo [15:33:18] the script copies everything from your laptop to inside lima-kilo (the workdir) so maybe that's the diff here [15:33:57] I deleted the image from lima-kilo, and from the toolforge-control-node (crictl rmi), re-deployed maintain-kubeusers to pick the upstream image (from tools-harbor), and rerun with the same issue [15:34:02] you are running the helper from outside lima-kilo, no? [15:34:03] (the tests failing) [15:34:06] yep [15:34:39] the included python code updates should fix that problem you are seeing, that's why I suspect about the wrong container image [15:35:36] a335ff50d9931e2638db3a0af681cc2685f39c59 <- this commit hash? [15:36:01] (I only ran from that commit of maintain-kubeusers) [15:36:45] that seems to be the commit, yes [15:37:08] I deleted all the recordings though, before running (so it would generate them again) [15:37:14] do you think that should matter? [15:37:19] the script does it for you anyway [15:37:52] inside the docker container? [15:38:47] the helper script should rm the cassettes both inside the maintain_kubeusers container image and your laptop's git repo [15:38:52] failed also with the cassets, now I ran with a clean unmodified a335ff50d9931e2638db3a0af681cc2685f39c59 [15:39:19] can you check the deployment image again? [15:40:07] and maybe paste here the full output of the helper script? [15:40:27] https://www.irccloud.com/pastebin/ntxI2YYG/ [15:41:23] https://phabricator.wikimedia.org/P61496 <- full paste [15:41:45] * dcaro rerunning again, see if an extra pass does anything [15:42:07] what do you have for imagePullPolicy? [15:42:19] I got a pdb console [15:42:37] https://www.irccloud.com/pastebin/iNdCbd94/ [15:42:46] probably coming from the deploy script [15:43:13] (as it imports it directly) [15:43:36] same here anyway [15:44:01] mmm wait [15:44:16] you have been playing with the fourohfour tool today, no? [15:44:55] thanks for fixing the puppetmaster issue! do you think that the inode problem also prevented any alerts from firing? [15:45:03] https://www.irccloud.com/pastebin/TrycFCn3/ [15:45:21] arturo: yesterday, but I completely recreate lima-kilo (thrice today) [15:45:47] dhinus: that's a good point, it should not :/ [15:46:23] https://www.irccloud.com/pastebin/yZyqbmm8/ [15:46:59] arturo: ^the image seems to be the same (though the size is a bit different, the hash is the same :/, probably the manifest? or crictl vs docker counting different) [15:47:56] arturo: how did you fix it in the code? (I have an interactive console with the test error, I can dig around a bit see if the code is the same) [15:48:45] https://www.irccloud.com/pastebin/YwLohp7a/ [15:48:46] dcaro: this diff https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/19/diffs#diff-content-46ecef6a906e8e77d1ec9964a292526a595ad881 [15:49:51] https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/19/diffs#note_80441 [15:49:56] I think that's missing the same fix maybe? [15:50:52] I get all the users [15:50:54] https://www.irccloud.com/pastebin/4kgNXkA1/ [15:51:03] but it seems to expect only the blarp/blurp ones [15:52:07] yes [15:52:13] but I don't get why it works on my machine [15:52:57] you can add a 'pytest.set_trace()' at that point, and run again, will allow you to get an interactive python console at that point [15:53:31] then `api_object.get_current_user_data()` will show what you get, might be different [15:55:23] hmm, I see that what it does, is list the namespaces that have the configmap that maintain-kubeusers creates [15:56:52] https://www.irccloud.com/pastebin/iTcq3KMm/ [15:57:03] so it seems it's getting the correct list [15:57:16] maybe they are not there for the existing users yet for you? [15:57:24] * dcaro gtg in a minute [15:57:35] I'm also running out of time, we can continue next week [15:57:54] thanks for the help and the debugging session [15:58:00] np [15:58:21] cya! [15:58:26] * dcaro off [15:58:32] o/ [15:59:55] I updated a new patch version anyway [15:59:58] * arturo offline [17:01:24] dcare: i'm pretty sure this patch would've fixed the puppetserver inodes issue. That said, I can't find the same problem on any other puppetservers so it may be unneeded. I'm not 100% clear on other side effects of -force, it might make things slower. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1025818 [17:01:29] um... dcaro ^ [23:04:46] * bd808 off and traveling tomorrow