[07:47:23] blancadesal: I see now your ping from yesterday. What's up? [08:00:47] arturo: no longer relevant, we had some issues with the k8s upgrade on toolsbeta but that's fixed now [08:06:01] ok! [08:45:19] increasing the RAM on the control nodes is a simple 'resize' operation, there is usually no need to rebuild them, no? [09:44:43] blancadesal: I think this is still relevant. I just came here to notify you that [09:44:56] that the problem is happening again so heads up [09:45:30] Raymond_Ndibe: did you and d.caro identify the root cause yesterday? [09:46:22] David and I recreated single 1.27 node and all seemed well. I just noticed all is not well actually, I after noticing the same issues we saw yesterday https://etherpad.wikimedia.org/p/k8s-1.26-to-1.27-upgrade-issues [09:47:29] We had some hypothesis but given the current re-occurrence those are probably incorrect [09:50:22] We deleted control-7 8 and 9, and created control-12. 10 and 11 are running 1.26 but control-12 is on 1.27. Currently looking at the issue [09:51:51] potentially related to controller-manager or kyverno then? [09:56:32] I don't think it's kyverno. But it's too early to say [09:57:11] atleast the kyverno error we saw initially was solved by deleting the stale pods [10:12:16] both controller-manager and kube-scheduler have error logs on one or more nodes [10:16:04] blancadesal are you running the tests in a loop like yesterday? that makes things a bit confusing. Maybe we should just do things manual for now until we figured out what the problem is? [10:16:57] Raymond_Ndibe: nope, not running anything. I though you were xd [10:17:41] nope. not me. Atleast not in a loop. I'm currently just performing `toolforge jobs list and delete` and looking at the logs [10:26:07] what errors are you observing? [10:32:06] what's interesting for me is that I'm not seeing any logs. Although I looking for something specific [10:35:03] I'm following the logs of the three kube-api-servers and attempting to delete and list the problematic jobs, but so far the requests don't show up in the api-server logs, even with -v set to 30 (and --all-containers). Maybe I'm doing something wrong [10:39:28] why do you think something is wrong at all? [10:41:34] If you want to make sure the upgrade is not the cause, you can revert the hiera value saying to use k8s 1.27 repo, create a new control node, and delete the 1.27 one (so you get 1.26 nodes all around) [10:58:04] Something else to try is to shut off two of the controllers and see if the problem is still there with just one (if it is, controller leadership issues can be discarded) [13:18:33] there are 2 alerts saying "puppet ca certificate is about to expire in 1d 11h", the runbook says to run the wmcs.vps.refresh_puppet_certs cookbook, can somebody confirm if that's what we need to do? [13:19:54] the certs are for 2 hosts in the linkwatcher project [13:20:51] ah they are old buster instances T367536 [13:20:53] T367536: Cloud VPS "linkwatcher" project Buster deprecation - https://phabricator.wikimedia.org/T367536 [13:30:01] I pinged a project maintainer in the task [13:30:53] I would still like to renew the certs to avoid further errors/alerts [13:47:24] dhinus: should work yes (have not used it in a bit, but the process looks ok) [13:48:41] essentially, remove client cert on client, remove client cert on puppetmaster, run puppet on client (to generate a new one, should fail saying that the cert is not accepted or similar), accept on puppetmaster, rerun puppet on client again (now it should finish the run) [13:48:56] thanks, I will have a go [13:48:58] * dcaro off (was draining another ceph node) [13:54:08] the cookbook failed with "The certificate retrieved from the master does not match the agent's private key" :/ [13:54:37] :/ [13:56:00] oh, I think it might be because they might be using old domains? coibot.linkwatcher.eqiad.wmflabs vs .eqiad1.wikmiedia.cloud [13:56:58] the cookbook might expect the new domain only [14:00:40] * dcaro cleaning up cert again on puppetserver, let's see [14:01:18] okok, it worked, it's buster so it complains, but runs ok [14:02:20] I did: client# puppet agent --test (fails); puppetserver# puppet ca clean --certname coibot.linkwatcher.eqiad.wmflabs; client# find /var/lib/puppet/ssl -name coibot.linkwatcher.eqiad.wmflabs.pem -delete (as the first failed run says); [14:02:38] then puppet agent --test again and it works [14:08:48] sorry I didn't read your comments and I was trying to re-run the cookbook. I saw it did generate the certs but I thought it was not enough... [14:09:49] it seems to me that the cookbook fails to read the new cert fingerprint from the puppet output [14:09:52] but it's there [14:10:05] so the cookbook bails out instead of continuing [14:10:48] "puppet agent --test" was working fine for me, but maybe we were both trying at the same time [14:12:33] I'm re-running your steps [14:18:45] it worked :) [14:19:05] I'll do the same for the other host [14:26:46] and both alerts are gone! [14:42:12] andrewbogott: do you know if some/any of what we do via keystone hook on project creation could be done via some kind of templating instead, maybe a template controlled via tofu? [14:43:19] Most of what the keystone hook does is set up ldap things that keystone would not otherwise know about. [14:43:31] I'm not sure I know what you mean by templating in this context [14:44:10] Certainly the management of e.g. initial quotas could be done by tofu but that would mean accepting that things created by keystone directly are automatically wrong, which I don't love [14:44:41] I'm reading this file modules/openstack/files/caracal/keystone/wmfkeystonehooks/wmfkeystonehooks.py [14:44:50] and I see neutron security groups being created [14:45:28] also, domains [14:45:29] yeah, I think what that does is: security groups, quotas, ldap (which is used by PAM) [14:49:05] could we replace most of this custom code with a heat template? [14:50:52] and maybe leave the code just for LDAP stuff [14:52:43] We probably could! I haven't experimented with heat outside of magnum but it could surely do all that. [14:53:11] I'm trying to decide if that's an improvement. It adds a new component, without the promise of ever removing the old one... [14:53:22] but a heat template would likely be easier to understand than the secret python stuff [14:54:07] I'm in need of modiying the logic now, to create a new resource, and I would prefer modifying a config file over modifying a python code [14:55:15] could tofu replace the heat template? [14:56:04] maybe [14:56:11] in my understanding, heat was born as an openstack-version of aws cloudformation, and terraform was born as a better alternative to cloudformation [14:56:37] but there might be use cases where heat makes sense [14:56:49] but I would like to retain the semantic of: there is some stuff created on project creation, then if the user deletes it, that's fine. Tofu may overwrite user decisions? [14:57:32] I see your point [14:57:56] I'm not sure if heat would not have the same problem [14:58:46] I want to believe we are not solving a new problem, and I want to believe there is something upstream to solve this without custom python code :-P [14:58:55] I like your optimism :P [15:01:23] :-P [15:04:49] heat does seem like the right tool for this, in theory :) [15:06:55] I'll open a phab ticket then forget about this :-P [15:10:07] T374253 [15:10:07] T374253: openstack: consider moving resource creation at project creation time to a templating system - https://phabricator.wikimedia.org/T374253 [15:10:26] this will get more relevante when we introduce tenant networks [15:10:33] relevance* [15:23:57] this is the patch for the keystone hooks that I'm developing, in case you are curious [15:23:58] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1071230 [16:39:40] * dhinus offline [16:54:46] arturo: have anything underway that would prevent new VMs from scheduling? [16:55:09] oh nm I see what it is [16:55:21] oh no I don't [16:58:00] oh yes I do. User was trying to create multiple VMs differing only in case [18:33:10] I need to change the MX records for wmcloud.org, they still point to our deprecated exim servers [18:33:26] should I file a ticket, or is this something I can do? [20:03:43] andrewbogott: ^ do you happen to know the answer to the above question, most of the emails seem to be addressed to you :) [20:07:53] jhathaway: is that https://phabricator.wikimedia.org/T352555 ? [20:11:11] no, though spf records are related, this is for the mx records of wmcloud.org [20:11:26] $ host wmcloud.org [20:11:28] wmcloud.org has address 185.15.56.49 [20:11:30] wmcloud.org mail is handled by 10 mx1001.wikimedia.org. [20:11:32] wmcloud.org mail is handled by 50 mx2001.wikimedia.org. [20:12:03] those hosts are set to be decommed [20:12:25] they have been replaced by mx-in{1001,2001}.wikimedia.org [20:12:28] ah, I see. [20:12:48] It can be changed via Horizon but probably not by you. Best to make a ticket and assign it to... um... [20:12:53] maybe arturo or david? [20:13:06] I'm going on sabbatical in an hour so assigning a ticket to me is not a great idea :) [20:13:44] :) [20:26:15] https://phabricator.wikimedia.org/T374278