[07:38:08] morning [08:06:09] morning [08:09:20] Raymond_Ndibe: dhinus the issue with lima-kilo multi-control node comes from the kubeconfig [08:09:24] https://www.irccloud.com/pastebin/nFeqoOEE/ [08:13:10] and that comes from the kubeconfig inside the maintaint-kubeusers pod [08:15:30] and that comes from a configmap `cluster-info` in the `kube-public` namespace xd [08:16:42] https://www.irccloud.com/pastebin/4X5nqSG4/ [09:29:49] arturo: deployment-prep had a DNS failure at 11:22:20 for commons.wikimedia.org :D [09:30:09] :-( [09:30:15] but at least the CI instance haven't had any dns failure since monday morning [09:30:42] my guess is deployment-prep makes more DNS requests overall and is thus more likely to show up the issue [09:37:25] I have no idea what is going on [09:37:55] dcaro: FYI cloudservices1005 also have the smartd offline uncorrectable sectors error, counter not increasing [09:37:57] https://www.irccloud.com/pastebin/dFpyBbeb/ [09:38:39] interesting [09:39:54] arturo: I don't know either :-/ At least for CI it hasn't happened since Monday morning. I am sure the root cause will be found eventually. [10:06:06] * dcaro lunch [10:07:56] cloudservices1005 has the same model drives as ceph [10:07:59] https://www.irccloud.com/pastebin/kJtm1Ogz/ [10:08:11] (the ones with the issue on dell side), interesting [10:09:02] did we have see any issues on that host?? [10:09:08] *did we ever [10:10:24] not that I know of [10:15:06] my understanding is that the issue is only triggered by Ceph doing very intensive read/writes to the drive [10:16:33] unrelated: Horizon is timing out when listing Trove instances, but only for project tofuinfratest [10:16:43] I can list Trove instances on that project with the CLI [10:17:43] it's not even that slow with the CLI [10:17:48] so I'm not sure why Horizon is timing out [10:18:08] I'll try deleting some Trove instances as they are all failed test instances [10:18:13] you have different credentials when using horizon compared to when using the CLI [10:19:16] you may need some additional role in that project to interact via horizon, which may or may not be a desired thing. The same happens with security groups [10:20:22] hmm how can I check that? I guess I could try using the same creds that Horizon uses [10:22:36] I deleted all trove instances in that project from the CLI, and now Horizon is working (showing no instances) [10:24:44] I'll check again after the daily run of "tf-infra-test" (scheduled in 2 hours from now) [10:26:56] The hosts that have those drives are (besides the ceph ones): [10:27:01] https://www.irccloud.com/pastebin/ame9Xmta/ [10:31:00] arturo: quick review? https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/95 [10:31:08] 👀 [10:32:11] dhinus: LGTM. Please run plan for the MR to make sure is a noop [10:32:18] sure [10:34:56] done [11:40:53] reviews welcome https://gerrit.wikimedia.org/r/c/operations/alerts/+/1078922 [14:10:20] tf-infra-test is still failing even after cleaning up all the failed trove instances, I created T376802 [14:10:20] T376802: tf-infra-test fails creating dbs and k8s cluster - https://phabricator.wikimedia.org/T376802 [15:17:07] dcaro: I did reset the trove quotas, but it's still not working [15:17:19] I can see the instances are stuck in "Building" in horizon [15:17:20] :/ damn [15:17:39] I'll try creating manually a trove instance in a different project [15:17:48] might be worth also checking logstash [15:17:58] (if you have an instance id or similar to filter with) [15:18:11] true [15:20:28] logstash is giving me a 500 :D [15:23:38] I can replicate the issue with a manually created instance [15:24:22] so I think trove is actually broken, at least creating new instances is not working [15:25:26] logstash is now working, let's see if I can find something there [15:28:09] maybe first steps could be: 1) restart rabbitmq, 2) restart trove services via the cookbook [15:31:35] this was an issue with the IDPs, it's resolved now [15:48:23] arturo: thanks. I'm not finding much in logstash, I will try restarting rabbit+trove, but I'll do it tomorrow morning to avoid breaking more things :) [15:48:38] fair [15:57:19] * arturo offline [17:15:01] * dcaro off [23:05:28] T376847 will probably look like a huge quota bump, but I think if you compare the end state to Toolforge's quota and then think about how many folks benefit from Jenkins CI it is not too wild. [23:05:29] T376847: Quota increase for Integration project (Jenkins CI runners) - https://phabricator.wikimedia.org/T376847