[09:02:18] morning [09:06:36] o/ [09:42:20] morning [10:48:20] dhinus: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/118 is ready to be tested and reviewed [10:53:34] arturo: I'll look at it in a moment [11:15:28] blancadesal: good news! it turns out that fastapi does "magicβ„’" and non-asynchronous code does not block the api itself (like it would using flask), I'm doing more tests be we might not need any async stuff [11:15:33] (for toolforge cli/k8s cli) [11:29:51] yep, we should be good without using async libs :), that simplifies a lot of stuff [11:34:20] dcaro: \o/ [11:48:39] * dcaro lunch [13:06:52] can I get a +1 here https://phabricator.wikimedia.org/T378975 or here https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/119 ? [13:11:45] done [13:36:30] I've updated the project description at https://phabricator.wikimedia.org/project/profile/2875/ removing the "no dash" rule, but keeping the "no underscore" one [13:37:31] dcaro: can I get a quick re-review of https://gerrit.wikimedia.org/r/c/operations/alerts/+/1084782 ? [13:40:50] +1d [13:40:53] thanks! [15:21:44] dcaro: can we merge this https://gerrit.wikimedia.org/r/c/operations/puppet/+/1087968 ? [15:22:09] Raymond_Ndibe: sure :), on it [15:22:59] Raymond_Ndibe: that was a really nice finding! well done [15:24:00] Raymond_Ndibe: merged, you can now run puppet on the nodes to make sure it works πŸ‘ [15:25:12] πŸŽ‰πŸŽ‰πŸŽ‰πŸŽ‰ [15:30:11] verified that is working as expected. closing the issue as resolved [15:30:38] πŸŽ‰! [15:38:19] dhinus: alert `Server cloudvirt1063 may have kernel errors`. This is "somewhat" expected if the server booted recently [15:41:19] yes I've just forced a reboot to test the alert [15:41:34] ok [15:41:38] the kernel error is [15:41:41] https://www.irccloud.com/pastebin/vMaXRvo7/ [15:41:46] there were other errors before the reboot, I'll wait to see if it settles, we might need to reimage [15:41:47] so totally something we can ignore [15:42:24] dcops just replaced cpu and mainboard on that one [15:42:28] can ignore, but should? https://www.kernel.org/doc/html/next/x86/sgx.html [15:43:50] I think I should probably reimage, and hope that resets a few things. wdyt? [15:52:18] I'm gonna kick off the reimage [15:58:21] hmph reimage is failing with "Error: Unable to establish IPMI v2 / RMCP+ session" [15:59:02] oh [15:59:07] I have never seen that error before [15:59:26] you may want to let DCops review it [15:59:52] if the motherboard was replaced, I wonder if the IPMI ethernet connector is well seated, or something [16:01:02] quota increase request for tools project https://phabricator.wikimedia.org/T379271 [16:01:02] quota increase request for toolsbeta project https://phabricator.wikimedia.org/T379270 [16:01:02] are the requested increases too much or too little, It feels like it's little? need comments before we can proceed [16:01:58] Raymond_Ndibe: both +1'd [16:04:24] arturo: Thanksss [16:32:15] * arturo offline [16:45:48] found the issue with reimaging cloudvirt1063: ipmi password was not in sync, fixed with https://wikitech.wikimedia.org/wiki/Management_Interfaces#Is_the_management_password_wrong? [17:00:18] dcaro: are you still around? Arturo approved the quota increases but I think I missed something and that is the file count. image registries usually have exceptionally high file count and the default right now for buckets is `4096`. I have a feeling this will cause an issue. If you have the time can you preview the quota increase phabricator tasks linked above and look at the suggested values? [17:00:33] Raymond_Ndibe: I'm here yep [17:02:50] Raymond_Ndibe: LGTM, I think the cluster should be able to handle that number of files, it's a service that it's not very used so we might be the first to do something like that [17:03:44] we can do some load tests in toolsbeta first to see how it goes [17:05:30] ok. yes we'll start with toolsbeta. That'd help get a sense of what it feels like and know if any issue will arise [17:05:57] πŸ‘ [17:57:35] * dcaro paged [17:58:13] sorry my fault :( [17:58:22] I'm looking into it [17:58:34] cloudvirt1063 is back in service but I think I forgot something [17:59:06] Ack, np [17:59:07] it's missing the canary probably. I was expecting the unset_maintenance cookbook to start it [17:59:18] but I probably need the "ensure_canary" one [18:00:06] Yep, the inset maintenance was used for machines already setup that were out in maintenance [18:00:44] Maybe there should be a 'bootstrap' or similar? (That does both) [18:00:52] bad memory from the last time I did this, tomorrow I will check that the docs are up to date [18:01:29] canary is now running [18:04:17] πŸ‘ [18:04:22] thanks [18:05:01] the alert is no longer firing [18:05:04] sorry for the page! [18:27:06] * dcaro off [18:27:09] cya!