[07:38:08] <arturo>	 morning
[08:06:09] <dcaro>	 morning
[08:09:20] <dcaro>	 Raymond_Ndibe: dhinus the issue with lima-kilo multi-control node comes from the kubeconfig
[08:09:24] <dcaro>	 https://www.irccloud.com/pastebin/nFeqoOEE/
[08:13:10] <dcaro>	 and that comes from the kubeconfig inside the maintaint-kubeusers pod
[08:15:30] <dcaro>	 and that comes from a configmap `cluster-info` in the `kube-public` namespace xd
[08:16:42] <dcaro>	 https://www.irccloud.com/pastebin/4X5nqSG4/
[09:29:49] <hashar>	 arturo: deployment-prep had a DNS failure at 11:22:20  for commons.wikimedia.org :D
[09:30:09] <arturo>	 :-(
[09:30:15] <hashar>	 but at least the CI instance haven't had any dns failure since monday morning
[09:30:42] <hashar>	 my guess is deployment-prep makes more DNS requests overall and is thus more likely to show up the issue
[09:37:25] <arturo>	 I have no idea what is going on
[09:37:55] <arturo>	 dcaro: FYI cloudservices1005 also have the smartd offline uncorrectable sectors error, counter not increasing
[09:37:57] <arturo>	 https://www.irccloud.com/pastebin/dFpyBbeb/
[09:38:39] <dcaro>	 interesting
[09:39:54] <hashar>	 arturo: I don't know either :-/  At least for CI it hasn't happened since Monday morning.  I am sure the root cause will be found eventually.
[10:06:06] * dcaro lunch
[10:07:56] <dcaro>	 cloudservices1005 has the same model drives as ceph
[10:07:59] <dcaro>	 https://www.irccloud.com/pastebin/kJtm1Ogz/
[10:08:11] <dcaro>	 (the ones with the issue on dell side), interesting
[10:09:02] <dhinus>	 did we have see any issues on that host??
[10:09:08] <dhinus>	 *did we ever
[10:10:24] <arturo>	 not that I know of
[10:15:06] <dhinus>	 my understanding is that the issue is only triggered by Ceph doing very intensive read/writes to the drive
[10:16:33] <dhinus>	 unrelated: Horizon is timing out when listing Trove instances, but only for project tofuinfratest
[10:16:43] <dhinus>	 I can list Trove instances on that project with the CLI
[10:17:43] <dhinus>	 it's not even that slow with the CLI
[10:17:48] <dhinus>	 so I'm not sure why Horizon is timing out
[10:18:08] <dhinus>	 I'll try deleting some Trove instances as they are all failed test instances
[10:18:13] <arturo>	 you have different credentials when using horizon compared to when using the CLI
[10:19:16] <arturo>	 you may need some additional role in that project to interact via horizon, which may or may not be a desired thing. The same happens with security groups
[10:20:22] <dhinus>	 hmm how can I check that? I guess I could try using the same creds that Horizon uses
[10:22:36] <dhinus>	 I deleted all trove instances in that project from the CLI, and now Horizon is working (showing no instances)
[10:24:44] <dhinus>	 I'll check again after the daily run of "tf-infra-test" (scheduled in 2 hours from now)
[10:26:56] <dcaro>	 The hosts that have those drives are (besides the ceph ones):
[10:27:01] <dcaro>	 https://www.irccloud.com/pastebin/ame9Xmta/
[10:31:00] <dhinus>	 arturo: quick review? https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/95
[10:31:08] <arturo>	 👀
[10:32:11] <arturo>	 dhinus: LGTM. Please run plan for the MR to make sure is a noop
[10:32:18] <dhinus>	 sure
[10:34:56] <dhinus>	 done
[11:40:53] <arturo>	 reviews welcome https://gerrit.wikimedia.org/r/c/operations/alerts/+/1078922
[14:10:20] <dhinus>	 tf-infra-test is still failing even after cleaning up all the failed trove instances, I created T376802
[14:10:20] <stashbot>	 T376802: tf-infra-test fails creating dbs and k8s cluster - https://phabricator.wikimedia.org/T376802
[15:17:07] <dhinus>	 dcaro: I did reset the trove quotas, but it's still not working
[15:17:19] <dhinus>	 I can see the instances are stuck in "Building" in horizon
[15:17:20] <dcaro>	 :/ damn
[15:17:39] <dhinus>	 I'll try creating manually a trove instance in a different project
[15:17:48] <dcaro>	 might be worth also checking logstash
[15:17:58] <dcaro>	 (if you have an instance id or similar to filter with)
[15:18:11] <dhinus>	 true
[15:20:28] <dhinus>	 logstash is giving me a 500 :D
[15:23:38] <dhinus>	 I can replicate the issue with a manually created instance
[15:24:22] <dhinus>	 so I think trove is actually broken, at least creating new instances is not working
[15:25:26] <dhinus>	 logstash is now working, let's see if I can find something there
[15:28:09] <arturo>	 maybe first steps could be: 1) restart rabbitmq, 2) restart trove services via the cookbook
[15:31:35] <moritzm>	 this was an issue with the IDPs, it's resolved now
[15:48:23] <dhinus>	 arturo: thanks. I'm not finding much in logstash, I will try restarting rabbit+trove, but I'll do it tomorrow morning to avoid breaking more things :)
[15:48:38] <arturo>	 fair
[15:57:19] * arturo offline
[17:15:01] * dcaro off
[23:05:28] <bd808>	 T376847 will probably look like a huge quota bump, but I think if you compare the end state to Toolforge's quota and then think about how many folks benefit from Jenkins CI it is not too wild.
[23:05:29] <stashbot>	 T376847: Quota increase for Integration project (Jenkins CI runners) - https://phabricator.wikimedia.org/T376847