[07:23:40] greetings [07:47:25] mmhh paws is busted? https://phabricator.wikimedia.org/T410712 I'll take a look [07:49:02] can't currently reproduce though, i.e. spawning a notebook works for me [07:51:19] ok to go ahead and merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1204369 with a reboot? noone currently logged in cumin2001 [07:51:52] moritzm: SGTM [07:54:40] ok, going ahead [08:13:05] cloudcumin2001 is back with nftables and keyholder rearmed, I'll do the other node later the week [08:18:59] \o/ thank you moritzm [09:00:35] morning [09:12:56] 2-liner MR to fix one bit for loki-tracing if anyone has time too look: https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1087 [09:16:25] LGTM [09:16:27] * godog errand [09:16:31] thanks [09:41:37] unrelated, there's something wrong in codfw and tofu-infra: "tofu plan" from /srv/tofu-infra fails with "S3: GetObject, https response error StatusCode: 403" [09:41:53] is object storage broken in codfw by any chance? [12:14:06] taavi: as "promised" in the meeting here I am to ask about the setting for the svc endpoint for the infra-tracing loki. Did you meant to just do something in horizon UI or there is some specific repo for this? [12:15:37] volans: .svc. dns records are mostly managed via tofu. I assume this will go through haproxy similar to api-gateway and such? if so it can just be a CNAME like https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/blob/main/toolsbeta/dns.tf?ref_type=heads#L152 [12:17:24] dunno, it should go through the loki nginx gateway, not k8s api's AFAIK [12:18:28] how is that service exposed outside of k8s? it's a NodePort I assume? [12:18:49] (note that api-gateway != k8s-api) [12:18:58] correct [12:19:03] nodePort: 30004 [12:19:53] does api-gateway manages also a valid certificate or a self-signed one? [12:20:09] self-signed [12:20:34] ok [12:20:54] but yeah, if it's a NodePort, then I think it should be behind haproxy as we don't want to point the DNS name to individual k8s workers [12:21:26] sure [12:22:19] we do tls termination twice but it's not a big deal [12:23:10] hm? you can do haproxy in tcp mode in this case [12:24:17] if both certs are self-signed yes that seems a possible optimization, self-signed for self-signed we can do tls termination directly in nginx if that's easily achievable [12:24:57] but otherwise we'd just have 2 tls coonnections (grafana - haproxy - nginx) [12:28:55] either way works for me I guess [13:41:04] moritzm: when you have a few, can you take a glance at the tomcat logs on cloudidp2001-dev? Originally they complained about a key-length mismatch... after much floundering I extended oauth_session_signing_key which resolved the key-length error but now there are several DecryptionExceptions. [13:41:05] Maybe when I changed oauth_session_signing_key I needed updating something that it signs? [13:43:36] FYI I'll move forward with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1207739 and potentially others today [13:45:29] sgtm [14:09:05] taavi: for those impossible-to-migrate VMs, are we back to the plan of just warning users and then start/stop? Or do you have other rocks you want to look under first? [14:10:07] andrewbogott: I don't think I even have theories on alternative ways to fix them, so I think that's our only option [14:10:38] ok. That's where I'm at too. Although I'm worried that even after start/stop they'll still be cursed somehow :) [14:10:57] Do you think I should only do that for the VMs that I can't migrate, or just bit the bullet and restart everything with a 1450 MTU? [14:11:21] hm... obviously I mean stop/start; start/stop would cause other issues :D [14:18:47] taavi: any preference between the two? does it require to add a frontend/backend section in modules/profile/templates/toolforge/k8s/haproxy/k8s-ingress-api-gateway.cfg.erb (thanks franc.esco for gaving me hints on where to look) [14:21:45] volans: sorry, I'm not sure if I understand what the other option you're talking about is. and yes, will need a new pair of those, but in a new file instead please [14:22:14] the two options being double tls or having haproxy just in tcp mode [14:22:31] let's just do tcp? [14:22:52] ack thx [14:32:16] andrewbogott: There's a setting to get CAS to accept your key length, but that should be handled in the template [14:33:21] Maybe we're just missing a key [14:34:09] In the modules/apereo_cas/templates/cas.properties.erb file, there is a <%- if cas_version >= Gem::Version.new('7.1') -%> and after that some setting of key lengths [14:35:40] Is there a reason why lengthening that key is likely the cause of the encryption exception? [14:36:18] If you already have something that's encrypted with the "wrong" key [14:39:29] right but in this case it's a session key, so that encryption would only be in my browser cookies wouldn't it? [14:40:33] btw the key length message was: [14:40:35] "2025-11-22 20:33:35,721 WARN [org.apereo.cas.util.function.FunctionUtils] - And in memory on CAS, but those would clear when restarting [14:41:00] 344 is a funny length [14:41:21] it's 43 chars which is how long the old key was :) [14:42:18] But also that warning led me to believe that I can't just configure it to accept a shorter key since it's comparing it to the length of something else [14:42:45] to be clear, I'm not 100% sure that the current issue I'm seeing is related to the key length at all; I'm just calling it out because it's one thing I messed with [14:43:17] Well it could be, if Puppet see CAS 7.1+ is sets the key lengths to 256 [14:50:47] hmmmm dropping that key length to 64 (to match the actual key) does not seem to make a difference [14:51:00] but I will try lengthening the key instead... [14:52:52] I'm still looking. You can also comment out the keys, then CAS will autogenerate new one when starting up, then we can compare the length [14:53:21] I'm trying one more test (restarting the service) and then I'll get out of the way [14:54:45] That's okay, I have meeting in a few minutes [14:54:49] me too [14:54:58] Otherwise I'll take a look at it tomorrow morning [14:55:17] ok, lengthening the keys to 256 got me the same errors as before + bonus issues [14:55:26] so I'm re-enabling puppet and I'll leave this to you. Thank you for looking! [14:56:43] Hopefully I'll have it ready for you when you logon tomorrow :-) [14:56:51] It's probably something silly. [14:58:38] andrewbogott: do you know what could be causing object-storage to randomly fail in codfw only? https://phabricator.wikimedia.org/T410265#11400598 [14:59:29] I don't, although that sounds like https://phabricator.wikimedia.org/T394061 [14:59:56] interesting [15:00:07] it's been quite a while since I looked at it [15:00:11] (starting a meeting now though) [16:51:44] what is the "maps-proxy-6" alert, and why is it tagged with team=wmcs? [16:52:11] maps-proxy-6 is a fun special thing in project-proxy [17:01:41] ? Does this still exist? [17:01:55] I thought it was left without known to care for it and got deleted [17:05:25] i suspect you're confusing that (a pair of weird of the cloud vps inbound web proxies) with something else [17:14:52] andrewbogott: seems like your repeated attempts to migrate project-proxy-acme-chief-03 off of cloudvirt1042 broke its networking somehow where maps-proxy-6 could no longer properly talk to it [17:15:10] (the issue fixed itself with a stop+start and a migration away) [17:16:20] taavi: thanks for fixing; I'm planning to schedule reboots later today and then give up on migration attempts in the meantime. [17:30:35] * dhinus off [19:44:09] andrewbogott: did you verify that a hard reboot is enough to fix the MTU on normal cases? [22:11:22] taavi: I verified that after a hard reboot they no longer match "'ip addr | grep "mtu 1450"' [22:12:13] that fact that only a subset of those cannot be migrated remains a frustrating mystery [22:17:52] also it has unblocked migration for the 3-4 hosts I've tried it on