[07:14:26] morning [07:16:10] morning! [08:17:16] good morning, and thank you for assisting yesterday in the outage [08:17:30] it made me feel really good to have you all around [08:44:07] on a quick search, it seems kyverno may be indeed prone to overloading the k8s api server in several ways [08:44:16] https://github.com/kyverno/kyverno/issues/8668 [08:44:16] https://github.com/kyverno/kyverno/issues/10049 [08:44:16] https://github.com/kyverno/kyverno/issues/10308 [08:44:16] https://github.com/kyverno/kyverno/issues/9633 [08:48:33] mmm the memory for the control plane VMs was increased yesterday [08:49:03] but I'm seeing the apiserver alone using 7GB RAM on control-7, 2.6 on control-8 and 4.5 on control-9 [08:49:05] https://usercontent.irccloud-cdn.com/file/qXTCrGLO/image.png [08:50:22] we may need to scale the control plane both horizontally and vertically, regardless of kyverno [09:35:38] taavi: I created T367389 and assigned to you, please let me know if you would rather have me working on it instead. Feel free to also update the ticket as you see fit [09:35:39] T367389: toolforge: improve HAproxy and k8s apiserver interaction - https://phabricator.wikimedia.org/T367389 [09:39:55] arturo: I fixed the description and unassigned me since I doubt I'll have the time for it [09:48:10] thanks! [09:49:06] merged the old task into it [11:08:48] i just got paged [11:09:03] if you mean cloudvirt1032, that was me, sorry, already fixed [11:09:17] ok, yeah, that was it [11:09:47] I shut down the canary VM to prepare it for a reimage and then got confused since apparently that's still in the old double-NIC setup(?) [11:28:14] ?? [11:43:47] dcaro: we can do the code review in the coworking space if you want! [11:50:31] Sure, I'll be there a few min late [11:50:48] no prob, I'll focus on something else, ping me when you are there [12:06:06] andrewbogott: i wrote a seemingly working script to update the request_specs rows, https://phabricator.wikimedia.org/P64833 [12:06:14] planning to run that in codfw1dev in a moment [12:08:32] arturo: I'm there now [12:08:39] ok [12:08:41] finishing something [12:47:14] dcaro: minor q re https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/23: why are we having the python app do the proxying instead of using `auth_request` to make the python app to do the authentication but leave the proxying to nginx [12:49:25] taavi: that'd be an option yes [12:50:13] can you set headers from the auth endpoint when proxying like that? [12:51:43] should be yes, but I don't have an example of that at the moment [12:52:02] looking [13:21:07] just added a copy of the builds-api dashboard for the jobs-api (in case anyone is curious) https://grafana-rw.wmcloud.org/d/kcAb-KUSz/jobs-api [13:22:28] the response_time graph being exactly 1s or NaN is rather suspicious to me [13:22:36] (also does it really need to be split per pod?) [13:28:27] I guess not, just copy-pasted the builds-api one, feel free to change :), (or well, just say, and I'll try to change at some point, too many things at the same time) [13:29:16] that one might be completely off though [13:37:26] I'm playing with histograms on those stats, as we export flask stuff from jobs-api the numbers mean different things than builds-api xd, so it needs extra work yep [13:41:28] draining cloudvirt1033, this is with 1031/1032 in the ceph aggregate but I ran the script on the to-be-drained VMs so it should be fine [14:22:20] taavi: Cool, I've never thought of using 'from nova import objects' before! I guess that uses nova's mysqlalchemy backend to edit the db rather than accessing directly? [14:23:32] andrewbogott: yeah. I found some threads on the upstream mailing list saying that it's some weird versioned object so modifying it manually without the nova apis might not work. [14:24:48] versioned, like, could be formatted differently for different VMs depending on when they're created? [14:25:03] Or versioned like, nova-manage db-sync updates it frequently? [14:26:50] versioned like there's some version number embedded in the field that needs to be bumped [14:30:31] yikes [14:33:02] https://lists.openstack.org/pipermail/openstack-discuss/2021-June/thread.html#22801 [14:41:31] Can I get a +1 for T367007 and T367266 [14:41:32] T367007: mediawiki2latex disk space - https://phabricator.wikimedia.org/T367007 [14:41:32] T367266: Add one floating ip to webperformancetest - https://phabricator.wikimedia.org/T367266 [14:41:33] ? [14:47:11] dcaro: done [14:47:20] thanks! [14:50:58] topranks: I see that there's an alert for interface errors already (https://logstash.wikimedia.org/goto/2b7a484eb03e56ed14f3666d55db0d0d) but I don't find it on operations-puppet nor alerts repos (it comes from librenms, so not sure how to deal with that xd), is there a way for me to include the 'team=wmcs' tag to it? [15:01:30] dcaro: myself and arturo were just discussing that [15:01:47] I don't know right now, but I am going to look into it [15:02:25] I found an interface on librenms, maybe there https://librenms.wikimedia.org/alert-rules [15:11:07] Hello, FYI we've released spicerack v8.6.0, there are no changes that should affect cloud-related cookbooks, so feel free to skip this one if you want. Changelog at https://doc.wikimedia.org/spicerack/master/release.html#v8-6-0-2024-06-12 [15:11:23] that's ofc for the cloudcumin* hosts. [15:12:56] volans: thanks :) [15:37:36] * arturo offline [15:38:45] volans: thanks for the heads up [16:12:40] andrewbogott: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1043142/ [16:26:36] andrewbogott: also, when will buster VMs (specifically those still using g2 flavors) will be all gone from Cloud VPS? [16:27:35] taavi: also https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1043149 [16:27:58] taavi: I don't know. As always, first buster has to be eliminated from production, then from deployment-prep, then from a few miscellaenous other places. [16:28:15] Of course that's only 30% of the VMs, the others will involve extensive user hounding [16:29:41] taavi: for both buster deprecation and annual project purge, I'm deep into the "wishing users would read their email" stage of grief [16:32:20] hmm, what was the announced deadline for migrating? [16:34:39] LTS support ends on June 30, 2024. But as always, we wind up having to wait for production upgrades since lots of the laggards are mirroring prod things [16:34:52] I'm going to do some ticket scrubbing this afternoon. [16:35:00] How many of those g2 VMs are left? [16:37:44] I can probably count the number of deployment-prep instances that are actually blocked on migrating in wikiprod too with my fingers.. [16:37:48] 85 instances it seems [16:39:57] here's what I'm planning to send to cloud-announce: https://etherpad.wikimedia.org/p/g4-flavors [16:40:50] taavi: lgtm [16:41:39] also now I'm migrating the VMs that accidentally got moved to 1031 to g4 flavors just to make them consistent with the system [16:42:09] taavi: I think you have two sentences spliced together at the beginning, "except that instances - the only difference is that instances" [16:42:12] looks good to me otherwise [16:42:29] oops, fixed [16:42:32] Should we just disable instance resizing by policy during the transition? [16:42:55] It's hard to do it gracefully but easy to do it rudely by just making it fail with an rbac error [16:47:33] hmmm [16:48:30] I don't love the idea, but it might be a better option than having people brick their VMs [16:49:29] taavi: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1043161 [16:49:51] If Horizon is smart that will also remove the option from the menu. [16:50:11] yeah, let's do that [16:51:19] want me to merge it right now, or leave for you to do later? [16:51:34] let's do it now, since the g4 flavors are already live [16:52:32] ok! [16:52:56] dhinus, remember that apt warning about openstack bpos? Do you remember if that's fixed and/or if there's a ticket for it? [16:53:01] (I mean, it sure looks fixed) [16:53:34] * bd808 should look up what instances he needs to rebuild or kill for the buster EOL [16:54:12] andrewbogott: I created a ticket, let me find it [16:54:32] andrewbogott: T366028 [16:54:37] thanks! [16:55:08] * taavi yells a few buzzwords to bd808 to make it sound easy [16:57:18] taavi: looks to me like that removed the option from horizon [16:59:33] andrewbogott: updated the proposed announcement in the etherpad [16:59:36] taavi: new email text looks good [17:00:04] ok, I'll send it out [17:05:06] * dhinus off [17:27:55] * dcaro off [23:22:39] andrewbogott: y'all should probably add and remove folks from https://gitlab.wikimedia.org/groups/toolforge-repos/-/group_members as an onboarding/offboarding task for the team. Anyone in that group will automatically have the same role in all of the repos that Striker makes for tools. [23:23:33] yep [23:23:52] * bd808 adds dhinus [23:33:19] I was really hoping to see a bigger drop in buster VMs after that refresh :( [23:36:15] I think June has snuck up on many of us :/ [23:48:31] bd808: how do I issue myself a phab api token? [23:53:55] hm, I guess I can use strikerbot's