[05:07:46] taav.i: I rebuild paws and quarry magnum clusters with g4 flavors and they're working fine. I used the migration script to move superset and one of the clusters (test-superset.wmcloud.org) moved over fine but the main cluster died. I've been trying to revive it for a while but tofu seems unable to build a working cluster... probably I'm missing a step. (I'm actually not clear on if superset has users or not). [05:08:03] I need sleep and am off tomorrow but i'll check in in the morning regardless. [07:43:58] FYI, I'll be creating a decision request with the proposal to change course in the PSP migration project. I think we should try next with a custom admission controller, rather than with a policy agent [07:54:53] damn kyverno :/ [08:02:26] this is the latest https://phabricator.wikimedia.org/T367386#9904103 [08:40:44] do you know why it crashes? [08:47:37] I think there are a number of factors at play [08:49:07] once all the policies are installed, the api-server will forward requests to kyverno so it can validate them. There is a timeout for this, and kyverno can't respond in time, causing the api-server to deny the request [08:49:59] kyverno also scans all resource objects of the cluster in the background to make sure they comply with the policies. This also injects a huge load in the k8s control plane [08:50:36] I tried with resource limits for kyverno, but it only results in the kyverno pods crashing and getting OOMkilled, which introduces yet more instability to the whole process [08:50:49] ugh [08:51:23] in my opinion, this is a strong evidence that this is not the right architecture for what we want to do [08:51:29] I created T367950 [08:51:30] T367950: Decision Request - Toolforge pod security via custom admission webhook - https://phabricator.wikimedia.org/T367950 [09:16:51] arturo: this sounds like we should cancel the operation window for tomorrow? [09:20:41] taavi: yes [09:42:29] taavi: would you like to approve https://gerrit.wikimedia.org/r/c/operations/puppet/+/1047113 ? I think we can deploy now [09:43:40] looking [09:44:10] did you test in toolsbeta already? [09:44:15] yes [09:44:30] and I plan to rollout to toosbeta first anyway [09:45:23] ok, +1'd [09:45:59] thanks [09:54:20] I dont remember how to rebase the git repo on the project local puppetmasters [09:54:25] git-sync-upstream doesn't work [09:56:02] there is a systemd timer, that you can also trigger manually [09:57:03] I guess puppet-git-sync-upstream.timer [10:04:26] cloudlb1001/1002 haproxy alert [10:07:23] ok, went way, it was something about a recent restart [10:07:32] maybe it was an automated package update or something [10:09:04] yeah, it was a reload [10:09:06] https://www.irccloud.com/pastebin/GJhndySR/ [10:37:01] Raymond_Ndibe: please have a look at https://phabricator.wikimedia.org/T367961 when you have a moment [11:33:19] I'm migrating some more hypervisors to OVS [11:36:56] ack [11:49:33] andrew.bogott superset does have users. Looks like they are opening tickets [12:35:03] review requested: https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/45 [12:55:46] arturo: FYI T367971 [12:55:46] T367971: hw troubleshooting: cloudvirt1042, cloudvirt1043 fails to boot after a reimage - https://phabricator.wikimedia.org/T367971 [12:55:57] :-( [13:25:33] hmm "openstack coe cluster config quarry-124" fails with a 404 error, even if I can see the cluster with "coe cluster list" [13:25:53] you probably need to specify the project or use the cluster id instead of the name [13:26:05] I tried both ID and name, but not the project [13:26:50] setting OS_PROJECT_ID fixed it [14:02:01] good news: the nic issue seems to be limited to 1042/3, for example 1044 which is in the same batch reimaged just fine [14:02:24] also I managed to install the firmware upgrade by myself, so now waiting for the new reimage to complete to see if that that fixed the issue [14:27:34] jobo, slyngs, there's still quite a bit of technical discussion on https://docs.google.com/document/d/1kIQ9W2gyYnUEqHhOkP51HQ2pTdwFWLkCHN4E9GkyvkQ/edit that needs response from either slyngs or a to-yet-be-selected mediawiki expert. Let's try to keep that work going rather than pausing everything waiting for a meeting :) [15:39:20] dhinus: I just saw in the kyverno slack channel a meeting invite for [15:39:37] a new kyverno feature to address the performance problems [15:39:41] in particular: [15:39:42] Managing policy and governance in busy Kubernetes clusters was difficult due to the high volume of policy reports, cluster policy reports, and ephemeral reports generated by Kyverno. This caused overloading of the API server and etcd, leading to poor cluster performance. Kyverno's new Reports Server addresses this issue by offloading these reports to a separate database, resulting in a 70% reduction in etcd consumption. Attend this [15:39:42] session to discover how the Kyverno team tackled this complex problem using API Aggregation and the advantages of storing reports in a dedicated database. [15:39:50] heh [15:40:08] too bad that we are limited in the kyverno version we can use for our migration -_- [15:45:22] ha! that's exactly our problem, isn't it? [15:45:43] yeah [15:46:18] it only mentions reports, though, and I think you said reports are just one part of the issue? [15:48:04] I'm not familiar with the kyverno internals, but hammering etcd with all kind of ephemereal objects seems like the right recipe to cause an outage [15:51:08] "70% reduction" looks like a good start, at least :) [16:05:40] * arturo offline