[07:45:05] morning, tools-prometheus-8 seems to have been unstable also yesterday :/ [07:48:59] hmm.... similar behavior: `Jun 22 00:00:04 tools-prometheus-8 prometheus-blackbox-exporter[571]: time=2025-06-22T00:00:04.425Z level=ERROR source=tcp.go:131 msg="Error dialing TCP" module=ssh_banner target=tools-k8s-worker-nfs-41:22 err="lookup tools-k8s-worker-nfs-41 on 172.20.255.1:53: dial udp 172.20.255.1:53: connect: network is unreachable"` [07:49:47] `Jun 21 23:54:54 tools-prometheus-8 systemd-timesyncd[469]: Network configuration changed, trying to establish connection.` [07:50:06] that's the first log I see that looks suspicious [07:52:06] https://usercontent.irccloud-cdn.com/file/QJkYJ7jm/image.png [07:52:14] it was rebooted a little after it seems [07:53:00] I see, andrewbogot.t rebooted it through a cookbook because it was not responding I guess [07:55:04] i'm still suspicious that the prometheus issues are due to query load, since they've only been happening with the active host (-8) and not with the other one (-9) [07:55:32] ideally we'd use thanos or something to spread the queries across those two hosts, but for now just flipping the proxy to the other host could be used to validate that [07:59:18] there was no peak load this time I think [08:00:45] nor ram/cpu etc. [08:02:08] there is a drop in network usage on -9 too [08:02:11] https://usercontent.irccloud-cdn.com/file/3bhFf9q7/image.png [08:05:46] I don't see anything in the logs though [08:07:54] it matches the time of the other prometheus too. It's when logrotate runs (00:00:00) and reloads apache [08:12:35] from the cloudvirt-72 logs: [08:12:36] ``` [08:12:39] https://www.irccloud.com/pastebin/gM86RIV9/ [08:12:52] that's the network of that VM [08:12:54] https://www.irccloud.com/pastebin/kcH9xLB0/ [08:13:06] that's probably the reboot though [08:57:28] morning! +1 for flipping the proxy to -9 to check if the problem moves there [08:57:43] does the drop in network usage match the time of the reboot? was the reboot for -9? [09:20:02] for -8 [09:20:08] * dcaro away to the doctor [09:38:17] anyone has any idea about the diff https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/51 is showing on tools? [09:47:08] taavi: hmm no, seems those instances lost their sec groups? but I can see those groups in horizon [10:58:44] Me neither :/ [13:16:09] after some digging I think that openstack is not letting the tofu user read port details from the api anymore [13:16:13] andrewbogott: ^ rings any bell? [13:17:59] taavi: yes, probably https://gerrit.wikimedia.org/r/c/operations/puppet/+/1162178. I'll just revert that and try again later, it didn't accomplish what I wanted anyway. [13:19:03] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1162901 [13:20:00] tbh I'm still not sure why this is breaking toolsbeta only, but still worth a try I think [13:37:07] taavi, that revert is deployed. want to retry? [13:37:14] sure, give me a second [13:38:19] > No changes. Your infrastructure matches the configuration. [13:38:23] yeah, that did it [13:38:53] good sleuthing :) [13:40:27] this also raises questions, but glad it's fixed for now! [13:41:37] dhinus: while you're here, can I get a +1 for https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/51? [13:42:07] sure! [13:42:46] +1d, not sure if they have to be empty for tofu to destroy them [13:44:09] they're empty either way [13:48:22] in theory https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/50 re-creates buckets in the right project, if I have the multi-project logic working correctly [13:52:21] looks promising [13:53:30] in particular I'm not sure whether we will need per-project app credentials [13:55:57] yeah likely, or can you generate creds that span 2 projects, but not all projects? [13:57:00] the service account has only access to those projects [13:58:05] the main thing is that i think that app credentials are supposed to be per-project. i'm not sure if the diff means that i'm wrong or whether creation would still fail (or whether it'd be created in the wrong project?) [13:58:49] mostly I worry that having to maintain more than 1 app cred per deployment by hand is going to get annoying rather quickly [13:59:35] if it's two I think it might be manageable [13:59:58] yeah [14:00:10] but I think harbor will take that to three at some point [14:00:11] and so on [14:00:14] do you want to try and apply that one to see what happens? [14:00:19] sure [14:00:28] I +1d it [14:00:53] merging [14:03:24] https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/jobs/544223 [14:03:52] well it claims it created them [14:04:30] aaaand they're in the toolsbeta project [14:05:04] dhinus: I think we need a separate application credential [14:05:48] let's add them for now, but I will also do some research on what options we have if we will need more than 2 in the future [14:07:41] give me a second to set that up [14:10:15] andrewbogott: I think the issue with the new policy is that it required the 'reader' role to read things, while the old/current policy also allowed 'member' [14:11:17] Looking! which rule are you looking at specifically? [14:12:25] I don't know exactly, I just noticed that difference between the tools and toolsbeta app creds [14:12:46] ok, I see what you mean... [14:13:18] * andrewbogott annoyed that https://review.opendev.org/c/openstack/neutron/+/886214 died without a merge [14:19:51] dhinus: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/52 might work, and we likely need to manually delete the buckets from the wrong project [14:25:41] I wonder if we still need to provide the tenant_id if the creds are mapped to a specific tenant [14:25:58] dhinus: I have gone in the end for adding a long_status for the build instead, added it to the clis too (and found a bug) [14:26:15] dcaro: ack [14:28:00] taavi: or maybe use tenant_name to avoid hardcoding the alphanumeric tenant_id? [14:28:13] let's try just removing it [14:28:42] they both look optional [14:29:02] but I'm not sure if keystone will like it [14:30:03] actually, let's merge this as is and see if we can remove them separately? since the main provider already has it, and I don't want to do that unrelated change here, and I also don't want to make them inconsistent [14:31:55] taavi: sgtm [15:07:02] re tools-static: https://phabricator.wikimedia.org/T397634#10938688 [15:44:57] taavi: thanks! [15:52:59] the prometheus vm flap https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/53 [15:53:33] ci is struggling though [15:54:33] > no space left on device [15:54:37] sounds like a general CI issue? [15:56:10] yep, they are looking into it (asked in #wikimedia-gitlab) [16:07:49] ci is passing now, https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/53 ready for review [16:08:44] ship it [16:18:52] 🚢 [16:57:29] dang it, apparently I didn't manage to save the correct password for the toolsbeta-logging service account I created. (in case you wonder about that password reset mail) [17:04:32] I was! [17:19:33] result of the above: T397651 [17:19:34] T397651: CAS not letting new Toolsbeta-logging developer account log in - https://phabricator.wikimedia.org/T397651