[09:08:52] hmm... apt stopped installing the CI built packages over the ones already installed in lima-kilo (for jobs-cli at least) [09:09:02] Note, selecting 'toolforge-jobs-framework-cli' instead of './toolforge-jobs-framework-cli_16.1.13_all.deb' [09:09:16] this is the command line used 'sudo env DEBIAN_FRONTEND=noninteractive apt install --yes --reinstall --allow-downgrades ./toolforge-jobs-framework-cli_16.1.13_all.deb' [09:09:36] is there any way to force it to use that file, like, for sure for sure ignore everything else way? [09:10:38] even if I remove the package it still ignores the one I downloaded [09:26:39] dhinus: for the `zuul-runners` project creation at T396540 , I will ask to rename it to `zuul-workers` that is the semantic used by upstream [09:26:39] T396540: Request creation of zuul-runners VPS project - https://phabricator.wikimedia.org/T396540 [09:30:20] or maybe zuul-nodes :) [09:37:50] dhinus and taavi: More of FYI, I'm reworking parts of replication to wikireplicas to use table catalog: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1155210 [09:38:21] (I'll probably remove the "fullViews" section of maintain-views file after this) [09:38:44] replacing with list of tables that are marked as public [09:54:45] Amir1: awesome! I started some refacting on maintain-views but I was sidetracked by other things... please go ahead with your changes [09:55:29] I hope it reduces the toil a bit [09:56:35] big +1 from me [10:22:45] quick review for fixing some weirdness in lima-kilo toolforge_deploy_mr.py when installing packages https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/247 [10:33:46] kinda small review also to fix an unexpected non-backwards change in the jobs-api https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/810 (3 patches there, the tests, the api and the cli linked in the comments) [12:19:09] * taavi brb [12:30:31] taavi: no rush, but when you're back can you review this one? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1155229 [12:35:16] dhinus: you can add extra stuff that made it fail in the test here https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/changes/29/1155229/3/modules/profile/spec/classes/profile_wmcs_services_maintain_dbusers_spec.rb#10 [12:35:33] (lgtm, though I'm not sure how to test) [12:36:34] hmm. not sure if we can make it use the hiera value from https://gerrit.wikimedia.org/g/operations/puppet/+/ef665072e352d0928e627a8033773a58d5a3ce6c/hieradata/cloud.yaml#690 [12:36:39] (for the tests) [12:36:58] dcaro: the test is that PCC was failing before the fix, it's not failing now :) [12:37:04] the value for the port looks different [12:37:18] dhinus: then it's already testing the fix :) nice [12:38:47] maybe try pushing a separate patch removing 'section_ports' from the test, and see if it works [12:40:14] I vaguely remember there was some weirdness with hiera loading and the tests, but I can try [12:41:24] dhinus: lgtm [12:53:48] thanks, merged! [13:54:51] for my future self... doing an http request with only bash `exec 5<>/dev/tcp/127.0.0.1/8000; echo -e "GET /metrics HTTP/1.1\r\nConnection: close\r\n\r\n" >&5; cat <&5;` [14:22:45] bd808: heyo, you are still in the "paging hierarchy" for WMCS victorops config, do you want to be removed from it? (as in, you are very welcome to stay, but want to make sure you have a chance to opt out if you want xd) [15:03:09] andrewbogott: chuckonwu the volume that's stuck if you are interested is 12198943-e569-45fb-8557-829513813ede 'toolsbeta-prometheus-1', from the toolsbeta project [15:03:36] Is the contents of that volume important or are you trying to delete the volume along with the attachment? [15:05:55] no, the instance was removed already [15:06:15] and yes, I'm trying to delete the volume :) [15:06:47] dhinus, chuckonwu: this is the patch for the toolforge_deploy_mr fix https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/247 , feel free to review if it works for you too [15:06:49] great, then I won't worry about breaking things [15:07:02] {: [15:08:11] This is the kind of problem that at least looks like rabbitmq split-brain so going to start by rebuilding the rabbit cluster [15:19:10] sorry about the alerts, will clear shortly [15:21:35] dhinus: added some extra logs for easy debugging the deploy filters and such https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/812 [15:25:51] dhinus: chuckonwu and this is the fix for the missing tests for toolforge-weld https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/813 [15:31:08] andrewbogott: thanks for dealing with all that, feel free to ask for help if you need some [15:31:44] rabbit rebuild does not seem to have helped at all [15:31:58] hmpf [15:32:11] could it be that it's stuck in a bad state? (caused by the rabbit issue) [15:32:37] yeah, likely. I'll probably dive into the database shortly [15:37:00] dcaro: I haven't taken a turn being active in the pager rotation for years at this point. I don't really care either way about having the account. If giving up a seat puts money back in the WMF's pocket then it probably should happen. [15:42:12] bd808: oh, I think you are actually active in the rotation for WMCS, that's why I ask (maybe you have not gotten paged lately? that'd be a data point for us getting better at it xd) [15:47:24] dcaro: I'm in the WMCS group, but I haven't been in an active rotation since sometime before Nicholas left. [15:48:12] https://usercontent.irccloud-cdn.com/file/VhmGgc4W/image.png [15:48:24] you are still there :), only if nobody replied earlier though [15:49:13] I'll guess then that we always replied before it escalated xd [15:49:39] being in the "oops nobody answered" is fine with me. I don't think I have gotten a page in 3+ years though. [15:50:24] 👍 [15:50:28] These things are apparently not exposed well in the mobile app. [15:51:10] yep, and quite confusing in web too, those things collapse without any sign of it being collapsed or anything, making it easy to miss if you don't know that clicking it expands them [15:54:38] quick review too https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/85 prometheus fix stuff [15:57:54] dcaro: Looking at more screens I think at this point I would only get a page for the "everyone anytime" group which is only triggered if things go for 30 minutes without being acked. That's probably why I haven't gotten paged in quite a long time. [15:59:29] "Everyone awake in the zone" only has andrewbogott, dhinus, and dcaro listed. And taavi isn't even in the team at this point. [16:01:10] yep, it's the fallback group yes in case the oncall does not reply, and nobody in the zone replies either, taavi having been in-and-out and contract based made it hard to track I guess (but taavi I'll happily add you to the rotation if you want xd) [17:19:48] andrewbogott, dcaro: I took a look at toolsbeta-prometheus-1 and tried manually setting the volume to 'available' status but a policy move blocked that. Since the instance Openstack thinks the volume is attached to doesn't exist, the volume should be reset. https://www.irccloud.com/pastebin/CYic8dRp/ [17:20:49] hm, I haven't tried resetting it yet. The timeout makes me think there's some internal reference to a cloudcontrol that doesn't exist anymore so I've been chasing that. [17:20:56] But I will try resetting, maybe that's the easy fix [17:21:58] hm, nope. Resetting will probably allow us to delete the volume but I don't want to do that until the attachment is deleted or it'll just be twice an orphan [17:33:16] chuckonwu: I'm going to just delete it from the database unless you want to keep tinkering [17:33:49] It's fine to reset andrewbogott [17:37:22] ok, done, now you should be able to delete the volume [17:38:27] andrewbogott: chuckonwu quick review? https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/48 [17:38:43] that's the one I wanted to merge when I found the volume issue [17:39:18] I've approved it [17:39:39] gitlab sure does log me out every few hours [17:39:40] thanks :), let's see if it's able to delete the volume then https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/pipelines/114426 [17:39:53] anyway, lgtm [17:40:05] andrewbogott: that used to happen to me too, not sure what I changed but lately it's better (alertmanager also) [17:41:12] `│ Error: Error deleting openstack_blockstorage_volume_v3 12198943-e569-45fb-8557-829513813ede: Expected HTTP response code [202 204] when accessing [DELETE https://openstack.eqiad1.wikimediacloud.org:28776/v3/volumes/12198943-e569-45fb-8557-829513813ede], but got 400 instead: {"badRequest": {"code": 400, "message": "Invalid volume: Volume status must be available or error or error_restoring or error_extending or error_managing [17:41:12] and must not be migrating, attached, belong to a group, have snapshots, awaiting a transfer, or be disassociated from snapshots after volume transfer."}}` [17:41:14] nope [17:41:33] so many statuses xd [17:41:42] ok, I'll just delete it here [17:41:46] thanks :) [17:41:59] * dcaro off [17:42:12] cya tomorrow! thanks again for the volume stuff chuckonwu and andrewbogott! [17:46:16] wow this volume is all sorts of cursed