[00:01:59] andrewbogott: hmm... maybe just one? I see the POST to /auth/login/ redirect to https://openstack.codfw1dev.wikimediacloud.org:25000/v3/auth/OS-FEDERATION/websso/openid and that redirect has "?origin=http://.../auth/websso/". Things never seem to get back to labtesthorizon for me. I get stopped when I get back to `https://openstack.codfw1dev.wikimediacloud.org:25000/v3/auth/OS-FEDERATION/websso/openid?origin=http://labtesthorizon.wikimedi [00:01:59] a.org/auth/websso/` with a 401 after visiting cloudidp-dev.wikimedia.org [00:02:59] DO you get 'The information you have entered on this page will be sent over an insecure connection and could be read by a third party'? [00:03:09] If I click 'yes' to that I see horizon [00:04:59] no, I don't get that far. The place that would redirect me back to the http URL is instead telling me that I didn't get authorized by the CAS server I guess. [00:05:44] huh [00:06:07] * andrewbogott checks the logs [00:06:30] labtesthorizon -> openstack.codfw1dev -> cloud-idp-dev -> openstack.codfw1dev -> 401 [00:07:01] that 401 may just be that my Developer account makes CAS mad in some places [00:07:07] are you using bd808-labtest or BryanDavis-labtest? [00:07:15] (whichever, can you try the other one?) [00:08:29] I've been using BryanDavis-labtest (the cn). You think things want the shell name (uid) instead? [00:08:41] not really, just trying everything :) [00:08:57] I can see CAS working and pulling up your shell name in the logs [00:09:04] which it should then be passing on to keystone [00:09:33] I am giving keystone the 'OIDC-preferred_username' which is a weird name for it [00:10:50] huh. mysteries [00:10:54] ok, I bet I'm mapping the wrong value, invisible because they're the same for me. [00:11:02] Let me map the other name, stay tuned... [00:12:48] bd808: try now? [00:13:37] yup. that fixed Keystone's view of me [00:13:46] I get the insecure warning now [00:14:01] oh good [00:14:02] and if I click past that I'm authed to Horizon [00:14:30] the "http" problem seems to come from the Horizon side at the start. Horizon sees the POST to /auth/login/ and issues a redirect that has the "http://..." return path in it. That bad "http" is where the warnings come from later. [00:14:30] So that's an important bug fixed, but the behavior now is extremely ugly [00:14:49] oh, that's useful! I'll look at horizon config... [00:14:56] so, what code is handling /auth/login/ is my next question [00:16:50] from horizon local_settings: [00:16:55] https://www.irccloud.com/pastebin/F8q6FNLo/ [00:17:06] that suggests that it's already http by the time it gets there... [00:19:20] This makes me think that Horizon doesn't know that there is external TLS termination. It may not normally make a difference as Horizon may normally never generate a fully qualified URL. It may only to full path URLs normally. [00:20:06] OK, so I need to set X-Forwarded-Proto in Horizon's apache config? [00:21:52] yeah, that would be a solid start. Apache should tell Horizon that the is TLS upstream with a "X-Forwarded-Proto: https" header injected into the request [00:22:32] will that break local test/dev? Do I need to make it conditional? [00:23:37] Apache should not send that header unless it is true (I think that's what you are asking?) [00:24:48] The django config for Horizon may also need something like `SECURE_PROXY_SSL_HEADER = ("HTTP_X_FORWARDED_PROTO", "https")` if it is not already there -- https://docs.djangoproject.com/en/5.1/ref/settings/#secure-proxy-ssl-header [00:25:07] right now it has [00:25:12] Header always merge Vary X-Forwarded-Proto env=ProtoRedirect [00:26:42] andrewbogott: ah. is there a rewriterule just before that looks something like `RewriteRule ^/(.*)$ https://%{HTTP_HOST}%{REQUEST_URI} [R=301,E=ProtoRedirect]`? [00:27:03] yes [00:27:05] https://www.irccloud.com/pastebin/fNpV4QBu/ [00:28:28] :nod: that's the magic redirect to https and tell the proxied host we did that config that Ori came up with like 10 years ago :) [00:29:11] So now the thing you likely need is the Django config to trust that header when it is set. That's the `SECURE_PROXY_SSL_HEADER = ("HTTP_X_FORWARDED_PROTO", "https")` setting [00:30:04] that should be safe to add in config that is used both with and without the header. it should just tell Django to make https full URLs when it makes full URLs [00:30:28] but will only make them https when the header is present telling it to [00:31:45] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/horizon/deploy/-/commit/9d4611d917fc10946b6596100c109e9019f3f70e [00:32:09] So I should ditch the first line of that patch? [00:32:16] no, that goes in the Django config for Horizon, not Apache [00:32:23] ah! Well that's much easier [00:33:22] (it was there already but marked out) [00:37:47] actually... the marked out bit was 'PROTOCOL' and your suggesting was 'PROTO' [00:37:48] suspicious [00:38:22] I'm not sure I'm clearing all the right caches but I think it's working right now! [00:39:01] oh, I know how to check for sure [00:39:18] yes indeed, it's fixed. Thank you! [00:39:27] Let me tidy up a bit so you can get the full experience :) [00:45:58] I did a logout/login cycle and I ended up authed again without the insecure warning. :) [00:46:37] nice [00:47:06] I need it to skip that horizon panel and go straight to sso... but that may have to wait until tomorrow [00:47:28] Thank you for the fix! I'm not sure I ever would've found that django setting [00:50:20] hm, would you expect the horizon 'log out' menu to log out of CAS? Otherwise you can just pop back in using the cached SSO [00:51:10] I would be very surprised by a CAS client's logout actually logging me out at the CAS server [00:51:19] like mad as hell surprised ;) [00:51:42] great, then i won't implement that [00:53:40] I suspect the Django config issue as soon as I saw that return url in the redirect. If I hadn't been distracted by a video call with RobLa I probably would have gotten you working faster. Glad its working now though. :) [00:56:34] Time for watermelon! [00:58:36] * bd808 wanders off towards the smell of food too [07:36:14] good morning [07:36:18] the alert [07:36:39] Toolforge Kyverno low policy resources [07:36:57] has been flapping overnight, and that's a bit concerning [07:37:06] will do a bit of research later [07:38:00] I'm looking into it [07:38:08] might be related to the failed helm upgrade [07:39:54] T373972 [07:39:55] T373972: [infra,k8s,kyverno] Toolforge Kyverno low policy resources tools - https://phabricator.wikimedia.org/T373972 [08:01:24] got a meeting now [08:02:02] arturo: feel free to continue investigating, I've added what I've found so far in the task, it seems like leadership is lost due to timeout getting the lock of the secret (and seems to affect kube-controller-manager too) [08:02:25] might be a result of the 1.26 k8s upgrade [08:10:26] ok, thanks [08:13:05] yes, it might be a problem with the api-server itself [08:16:04] ceph also flapped, and cloudlb haproxy [08:16:15] I wonder if there was some kind of network blip somehow [08:16:37] apiserver complains about etcd (that does not use ceph), network makes sense [08:17:29] PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast [08:18:40] ^^^ this happened overnight too [08:19:07] resolved itself a few moments later [08:19:20] I think it supports the idea of a network problem [08:19:44] is that the switch that crashed the last time? [08:20:09] etcd does not use ceph, but still can be affected by a switch going down [08:20:35] did a crash happen while I was on PTO? [08:23:12] the ceph one, was caused by a switch misbehaving and flapping interfaces [08:23:26] that included bgp sessions flapping [08:23:27] ok [08:24:18] it was d5, c8 is the next one that needs reboot [08:24:18] https://phabricator.wikimedia.org/T371878 [08:26:34] I see [08:32:43] FIRING: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad [08:32:48] ^^^ this might be related too [08:33:58] topranks: are you available to spot check the health of cloudsw1-c8-eqiad ? [08:51:56] dhinus: good catch, ssh might have been affected too [08:51:58] (and anything really) [08:54:52] is cathal on PTO maybe? [08:55:46] arturo: not according to gcal [08:57:32] arturo: hey no I am here [08:57:49] just catching up [08:58:06] hey o/ [08:58:59] looks like something may have happened on the switch ~3hrs ago [08:59:03] https://www.irccloud.com/pastebin/zHRLCXPj/ [09:02:51] and indeed it seems to be not unlike the issues we had with d5 [09:03:03] related to bfd timing out to servers [09:05:31] however the key difference is it happened several times over the period from 05:58 utc through to 06:08 [09:05:52] not constantly flapping like when we had issues with d5, nor is it ongoing now [09:06:47] what's the current status in terms of ceph or other apps? [09:17:14] I'm in a meeting, back in 40 minutes [09:18:42] ok [09:21:05] the TL;DR is right now things "look" stable [09:22:28] I still see the occasional log coming in about a timeout but bfd/bgp/lacp sessions are all stable since 06:10 UTC [09:22:37] traffic levels on the switch are normal for this time [09:22:57] but the previous instability is a big concern, and I can't guarantee there is no problem right now [09:28:30] topranks: there were a few more blips, do they have a match in the switch? https://phabricator.wikimedia.org/F57453859 [09:29:24] dcaro: I get "access denied" with that link? [09:29:32] hmm [09:33:27] https://www.irccloud.com/pastebin/2OVqt4Wa/ [09:34:32] dcaro: either way I'd say it's more likely they were caused by the switch, or just knock-on effects from the instability during the window I mention [09:34:53] in other words I'm not confident about it - even if it's stable right now [09:36:29] try now [09:38:46] * dcaro in a meeting sorry [09:41:00] topranks: How do you think we can resolve the instability in the switch? is this something like a part we can replace? or the whole unit, or what? [09:41:08] (I'm out of the meeting now) [09:41:27] I suspect probably a reboot like with d5 [09:41:37] powering it down fully as we did then [09:42:44] ok [09:43:16] we should prefer a scheduled outage over an unscheduled one [09:43:38] indeed yeah, so I guess much of the decision around that is going to be in terms of what is the current impact [09:44:37] do you know if we have spare switches that we could set up in a rather short term to enable another rack for our servers? [09:46:45] we have no "other rack" [09:47:13] there are spare qfx5100's though we could replace this one with yep [09:47:45] that would take a similar amount of time (maybe more) than rebooting rigth? [09:47:48] *right [09:48:01] in term of draining ceph, no? [09:48:42] oh, you mean replacing would take the same time as rebooting [09:48:46] (or more) [09:48:48] yes, I agree [09:49:38] but also, maybe we can have another push at getting another rack , or half rack, or whatever. Another switch unit, basically [09:49:43] if we can have both switches at the same time during the migration, we can move the ceph hosts one-by-one and not need draining [09:49:45] yeah moving all the servers and uplinks to the new switch would definitely take longer than the reboot [09:51:01] not sure, but currently draining the whole rack might not be doable (we are still provisioning the new nodes, but have to check the current distribution and data usage) [09:51:23] but have to check [09:52:23] the rack: https://netbox.wikimedia.org/dcim/racks/24/ [09:53:14] https://www.irccloud.com/pastebin/dsYYUQ8I/ [09:54:04] c8 currently has 153Tb, being the second biggest, when draining we will be limited to the smaller one left, that's 111Tb*3 in total (with 3 replicas leaves ~111Tb usable) [09:55:16] we are ~50% now though, so probably ok [09:57:30] I just created this T373986 [09:57:31] T373986: cloudsw1-c8-eqiad is unstable - https://phabricator.wikimedia.org/T373986 [10:24:17] blancadesal: google said that the recording will be saved in your drive [10:30:45] dcaro: ok, do you know if I need to do something, or is it automatically available to all meeting attendees? [10:31:13] no idea xd [10:31:31] np, I'll check it out after lunch [10:31:52] searching I don't find it though [10:32:10] (I don't know what the name is either xd, searched for the meeting name) [10:35:17] arturo: I added some extra detail to the task description thanks for opening [10:35:46] thanks topranks. Will you be able to make it to the network sync meeting later? [10:37:24] yep planning to [10:37:29] great [10:42:09] dcaro: iirc, once the video is ready, I'll get an email about it, and I _think_ it gets added to the meeting in the calendar as well [10:45:07] 👍 [12:10:13] this is awesome https://relabeler.promlabs.com/ [12:10:18] prometheus relabeler testing tool [12:13:21] neat [12:17:16] rabbitmq @ codfw1dev seems down, I'm restarting it [12:34:20] I'm deleting all nova-fullstack VMs, I suspect they failed because the network problems [12:36:47] dcaro: network meeting? [12:37:01] oh yes, sorry [13:16:11] Raymond_Ndibe, blancadesal: I'm in another meeting, I'll be in the upgrade meet in a bit [13:16:54] Yeaa I figured. the network sync meeting [13:17:55] next steps is merge and deploy maintaint_kubeusers (oh, I see it's merged xd https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/61) [13:18:08] and then plan when to upgrade toolsbeta k8s [13:18:18] (once the lima-kilo setup is clear it's working) [13:19:06] currently running functional test for that [13:21:48] thanks :) [14:30:41] andrewbogott: could you please take a look at this when you a moment? T374002 [14:30:41] T374002: codfw1dev: rabbitmq is not working because some auth failures - https://phabricator.wikimedia.org/T374002 [15:04:05] dcaro: I'm cooking a tofu patch [15:04:13] there was a patch you wanted to test? [15:04:30] https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1070020 [15:04:46] ok [15:05:26] thanks [15:06:10] andrewbogott: slyngs have you added something to the idp manifest? I think there might be a hiera value missing in cloud.yaml (cloudinfra-idp-1 puppet is failing, have not checked yet) [15:06:28] Sorry, YES [15:07:03] profile::idp::memcached_install: true [15:07:26] I'll add it in a bit, working on another patch right now xd [15:07:49] Like this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1070203/20/hieradata/common/profile/idp.yaml#49 [15:08:02] We need both, sorry [15:08:23] profile::idp::oidc_issuers_pattern: 'a^' right? [15:08:50] (as in that one also needs adding to clouds.yaml) [15:09:47] dhinus: andrewbogott arturo quick review? (adding cloudcephmon1005 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1070621) [15:10:46] what happened to 4? [15:11:21] it's in C8, and that's the rack we want to drain [15:11:39] (that currently has 2 mons in it, so we need to add a mon outside of it, so when the drain happens, we still have 2 mons around) [15:18:16] dcaro: with the patch, the tofu plan no longer shows up in stdout in the terminal, only in the gitlab MR? [15:18:59] maybe? that might be the spicerack config settings, looking [15:19:21] before the patch, I would see both [15:19:22] was it showing intentionally (print...) or just because it showed the spicerack output progress and such? [15:19:39] I think it was part of the cumin params settings [15:20:47] ack, easy fix, give me a minute [15:22:24] there is a gitlab bug that prevents proper markdown rendering inside
entries -_- [15:22:33] (in MR notes) [15:23:09] yep, I noticed :( [15:23:22] fix sent :) [15:29:39] thanks, bette now [15:30:05] wait, no colors? the output used to have colors [15:33:12] dcaro: I think we need to swap them, left a note in the patch [15:42:06] hmm... I'm getting a weird error when running puppet on cloudcephmon1005 [15:42:06] Error: The CRL issued by 'CN=Wikimedia_Internal_Root_CA,OU=Cloud Services,O=Wikimedia Foundation\, Inc,L=San Francisco,ST=California,C=US' has expired, verify time is synchronized [15:42:17] anyone has seen that before? [15:42:39] the date on the host looks correct [15:43:09] dcaro: that's a puppet5 vs puppet7 disagreement [15:43:18] ooooohhhh [15:43:19] probably the host was imaged with 7 but the puppet role expects 5 [15:43:27] you can just reset all the certs and it'll behave [15:44:06] Could I get someone other than me to log in to https://labtesthorizon.wikimedia.org, confirm that it works for n>1 user? [15:44:38] × [15:44:38] Your account is not recognized and cannot login at this time. [15:44:58] what user should I use? [15:45:00] using codfw1dev creds and not eqiad1 creds? [15:45:03] I'm inside, but I had the cookie [15:45:19] Should be the same creds as always for labtesthorizon [15:45:32] arturo: interesting, I'm surprised that worked :) But not displeased [15:46:01] want me to log off/on?¿ [15:46:21] arturo: you can try although logging out sort of doesn't work now because CAS caches things [15:46:46] (as it should, but I probably need to redirect the logout behavior somehow) [15:47:17] dcaro: oh, also, use your full name rather than your shell name (I think) [15:47:40] andrewbogott: signed out, got redirected to an IDP page, put my creds back in, clicked login, then got into horizon again! [15:48:06] that sounds pretty good :) [15:48:08] andrewbogott: I did not :/ [15:48:13] what credentials should I use? [15:48:24] dcaro: your codfw1dev creds, but the full name rather than the shell name [15:48:33] ooohhh, okok, let me try those [15:48:49] (it's vaguely possible that both names work but I'm not sure because mine are the same) [15:49:11] got in :) [15:49:56] andrewbogott: how do you clean the certs? I'm getting now 'Error: Could not download CRLs: Bad Request' [15:50:33] I think just rm -rf /var/lib/puppet/ssl/ [15:50:42] but you could also just totally wipe out and reinstall puppet :) [15:51:13] oh, I downgraded puppet (was 7, to 5), now the error is different [15:51:18] Exiting; no certificate found and waitforcert is disabled [15:51:35] you also need to manually sign on the puppetmaster [15:51:35] andrewbogott: I'm about to go offline for today. I send a couple of tickets your way: T374023 and T374002 [15:51:36] T374023: openstack: eqiad1: designate is maybe not working as expected (2024-09-04) - https://phabricator.wikimedia.org/T374023 [15:51:36] T374002: codfw1dev: rabbitmq is not working because some auth failures - https://phabricator.wikimedia.org/T374002 [15:51:44] arturo: ok, will look! [15:51:49] thanks! [15:51:57] I'm seeing the rabbit/codfw1dev thing too, will probably do a total reset there [15:52:58] ack, the signing on the master did it [15:53:01] * arturo offline [15:53:03] cool [15:53:15] It's kind of a mess, we should just finish off puppet5 [15:53:23] But jbond left in the middle, I'm not sure if anyone else is tasked with that [15:53:53] agree [15:59:32] we got 4 mons! \o/ [15:59:33] mon: 4 daemons, quorum cloudcephmon1003,cloudcephmon1002,cloudcephmon1001,cloudcephmon1005 (age 2s) [16:13:31] andrewbogott: login at https://labtesthorizon.wikimedia.org worked for me too, with my existing labtesthorizon creds [16:13:52] great, thank you for testing [16:17:53] I would like to add some obscure openstack resources to tofu, can someone remind me where the repo is for that? [16:18:13] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra [16:18:52] thank you! [16:19:09] Will this probably work for arbitrary/new resource types? Or is there overhead? [16:19:51] so far we've split different resource types in diff folders, with some small overhead per-folder [16:20:26] ok [16:20:40] the overhead is creating a new folder, then include the folder as a module in "main.tf" [16:20:59] That's not a lot! I will see what I can do [16:25:28] andrewbogott: I'm going to start draining C8 (so we can reboot the misbehaving switch there), what's left of the new osds to do? [16:25:48] dcaro: these three: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1063892 [16:28:37] hmm, all on E4? [16:29:10] they are replacements, that's ok [16:29:34] dcaro: "'Error: Could not download CRLs: Bad Request'" this error is i belive because you need to add `certificate_revocation = leaf` to puppet.conf (check a working system for the exact config) [16:30:01] dhinus: actually it looks to me like there's no provider for the resources I want to create (mapping, identity provider, federation protocl), do you agree? [16:30:05] jbond: thanks! I resolved it downgrading to puppet5 though xd [16:30:15] ack [16:30:39] * andrewbogott looking at https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/identity_service_v3 [16:31:33] andrewbogott: do they have APIs? I would find the API first then search for the endpoint name in https://github.com/terraform-provider-openstack/terraform-provider-openstack [16:31:59] registry.terraform.io also works, but github search is usually better [16:35:39] federation seems to be part of the "Identity API Extensions" https://docs.openstack.org/api-ref/identity/v3-ext/ [16:35:46] and that is indeed missing from the terraformp provider [16:36:15] I'm surprised they don't even have an issue for it, you could try opening one [16:39:18] For now I guess I'll write a wikitech page saying how I did it :/ [16:39:27] Or, better yet, a code comment [16:42:00] ok at least it's implemented in the underlying go sdk https://github.com/gophercloud/gophercloud/issues/2461 [16:42:47] so it should not be toooo hard to add it to the tf provider, if you want a go project for your sabbatical :D [16:43:24] :nerdsnipe: [16:43:41] wikitech/comments are fine in the meantime, did you configure things in Horizon or through CLI/API? [16:43:51] CLI [16:44:34] oh... just realized that andrewbogott is going on sabbatical, so he can't help with the drain of the rack xd, that means it's going to take a bit longer even than expected [16:45:36] having the list of CLI commands you used and some small comments would be great [16:45:45] the cookbook takes /almost/ all night anyway [16:52:11] dhinus: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1070643 [16:52:57] I think I found the issue with prometheus, I should be using `metric_relabel_configs` not `relabel_configs`, though both exist, the latter only applies "before" scraping, so it does not have the `__name__` of the metrics yet [16:54:48] * dcaro 🤦 [16:55:19] that might seem obvious to you but it is not [16:57:12] andrewbogott: +1d [16:57:13] it is not yep, the metric_* applies to metrics gathered, the other one applies to the target (the scraping endpoint) before, and bad naming, I had not noticed that they were different xd (if they were named `target_relabel_config` and `metrics_relabel_config` would be clearer) [16:57:32] thanks [16:59:39] so I have been a few days trying different weird configurations to try to make it work (regexes, label combinations, ...) [17:02:47] I'll call that a win, and clock off for today :) [17:07:08] andrewbogott: any idea why cloudceph1017 is marked as 'failed' in netbox? (just noticed) https://netbox.wikimedia.org/dcim/devices/3138/ [17:07:23] dcaro: no idea, sorry [17:07:38] it's pooled isn'tit? [17:10:29] yep, online and going [17:10:45] I'll change to ok [17:11:18] I changed it to failed at some point xd [17:11:59] https://phabricator.wikimedia.org/T358945 [17:16:59] dhinus: if still here would you proofread https://etherpad.wikimedia.org/p/horizonsso for me? thx [17:21:13] now I'm really going, cya! [17:21:50] * andrewbogott waves [17:23:33] andrewbogott: I did a few small tweaks to the etherpad, feel free to adjust [17:23:46] thanks. Makes sense overall? [17:23:53] yup, looks good! [17:24:08] I wonder if it's worth mentioning that wikitech login is NOT changing for now, but will change later [17:24:13] but maybe it's too much info [17:24:29] but some people might remember the previous announcements about wikitech SUL [17:24:57] the mention you made in the first paragraph is probably enough [17:25:16] I'll add something at the end too. [17:25:18] thank you! [17:26:27] yw [17:31:21] * dhinus offline [23:40:22] I just rolled out an implementation for T342848. The count of toolforge-python311-sssd-base usage went from 81 running pods to 692 running pods + cronjobs. I guess we have some folks using python 3.11 ;) [23:40:23] T342848: k8s-status: List inactive cron jobs as image users - https://phabricator.wikimedia.org/T342848