[00:34:43] was that harbor page anything? [05:59:45] I got paged too a couple times just now [05:59:49] looking [05:59:52] (they went away) [06:01:03] they seem to all be about not having data (prometheus down) [06:08:38] for prometheus, I suspect that it's pulling too many stats from k8s [07:02:26] dcaro: you online? [07:02:59] yep, kinda [07:03:08] (got paged early morning) [07:04:03] 🌄 [07:04:24] found the issue with calico, somehow in the 3.26 upgrade I missed changing a value from 'calico-node' to 'calico-cni-plugin' in the rbac stuff [07:04:52] what would be the best way to fix, reverting the upgrade commit and fixing, or creating a new commit on top? [07:05:20] (poor you, hope you had enough ☕ ) [07:05:29] still brewing :,( [07:05:51] I'd say commit on top if the other is merged [07:08:29] https://gitlab.wikimedia.org/repos/cloud/toolforge/calico/-/merge_requests/9/diffs [07:20:57] +1d [07:24:34] thanks! testing now [08:17:21] will try to deploy the fixed calico to toolsbeta shortly, just running the tests beforehand to make sure any failures are not from the patch (direct-api seems a bit flaky still) [08:27:27] dcaro: heads up, deploying secret agent calico 0.0.9 to toolsbeta, possible chaos ahead [08:31:09] nothing crashing, running tests now [08:33:21] \o/ [08:35:36] dcaro: while on vacation I think I saw something like you having a cookbook for running the func tests now? [08:35:45] yep [08:35:46] (all tests passed btw!) [08:35:48] and reporting back to the MR [08:36:05] you can run from cloudcumin1001 like [08:36:07] dcaro@cloudcumin1001:~$ test-cookbook -c 1061957 wmcs.toolforge.component.deploy --cluster-name toolsbeta --component api-gateway --run-tests [08:36:43] it works for toolforge-deploy MRs, and for client mrs (ex. the bump version MR of bulids-cli) [08:37:53] https://www.irccloud.com/pastebin/TEHSi7ir/ [08:38:13] sudo? [08:38:36] nope [08:39:38] that's silly :/ [08:40:39] you can sudo -i -u dcaro [08:40:42] (as a workaround) [08:41:15] or run from your machine [08:42:04] I've never been able to make cookbooks run from my machine :/ [08:42:34] that's a pity! we can do a quick session to set it up if you feel with energy for it [08:43:45] not right now, want to move the k8s upgrade forward asap :/ [08:43:55] but 'one day' xd [08:43:58] (current main issue is that spicerack does not work on python >=3.12) [08:44:08] at least without hacks xd [08:44:26] did you try the sudo trick? [08:44:55] I did, it's running now [08:51:17] alright. making a :coffee [08:51:35] deploying to tools next [09:07:27] this is awesome, thanks for the cookbook dcaro [09:14:54] yw! [09:20:12] dcaro: nothing is blocking the k8s upgrade now. toolsbeta tomorrow? :)) [09:21:52] 👍 [12:18:59] We have the toolforge check-in today, and the monthly, I think we can do a quick check-in and move the monthly to next week when there's more people around (I think it's just blancadesal, andrewbogott, and me) [12:19:27] dcaro: sounds good to me [12:20:09] is raymond ooo? [12:21:59] yep, this week [12:22:05] I think that the calendar did not update though [12:23:25] 👍 [13:48:26] cteam, I'm trying to get my continuity page in order. I doubt that it will contain anything that y'all don't already know, but I'd appreciate a quick look and suggestions about things I should include. thx! https://office.wikimedia.org/wiki/User:Andrew_Bogott/continuity [14:33:17] andrewbogott: nice, I'll give it a look, maybe we want to do a review during the team meeting also? (or a separated meeting, like we did with taavi) [14:38:52] yep, I added a note to the etherpad [14:38:59] awesome :) [14:44:28] blancadesal: have you worked on striker at all? We have a small but important change coming up for the wikitech migration and I'm not sure anyone (other than the shade of bd.808) has local test/dev working [14:48:16] andrewbogott: I have poked around a bit, but don't have any local dev env set up currently. do you want me to look into it/elaborate on what needs doing? [14:48:54] * dcaro clocking off a bit early today, page me if anything blows up \o [14:49:26] blancadesal: Do you have the bandwidth to work on it a bit? The thing we're doing is switching the 2fa backend validation from wikitech to bitu; it's a pretty small chance but non-trivial to test. [14:49:39] The specifics are mostly in my head and in email so I need to get them written up in phabricator regardless :) [14:49:46] s/chance/change/ [14:57:28] andrewbogott: if you create the Phab ticket, I can go from there and ask you questions before you ride off into the sunset [14:57:51] great, I'll find and/or write a ticket [14:58:16] I'll go find some bandwidth in the meantime xd [15:12:33] blancadesal: this has at least part of the story T359554 [15:12:34] T359554: Use IDP for authentication in Striker - https://phabricator.wikimedia.org/T359554 [15:12:58] FYI I just added legoktm and lucaswerkmeister to the monthly Toolforge admin meeting invite. [15:14:32] blancadesal, andrewbogott: the "fun" part of getting a Striker dev environment working is the Keystone service. Mine broke locally and I never managed to fix it. T359972 [15:14:33] T359972: Dev environment Keystone crashes - https://phabricator.wikimedia.org/T359972 [15:15:33] bd808: I didn't even get that far yet. [15:16:03] An alternative is to test in production (-ish) by getting striker to work properly in codfw1dev and then using that for test/dev. That would save us the trouble of creating a local keystone. [15:16:29] btw, bd808, while we're on the topic, can you clarify the value of striker-on-cloudvps vs striker-on-laptop vs striker-in-codfw1dev? [15:16:39] I'm sort of hoping that you'll say that striker-on-cloudvps is no longer useful :) [15:17:48] hmmm would striker on codf1dev require a working toolforge in codfw1dev, or just a working ldap? [15:18:00] * andrewbogott is full of questions, that's not even all of them [15:20:13] striker-on-cloudvps was always for proving things worked outside of my laptop before breaking prod. codf1dev or toolsbeta deployments might solve that problem for a new generation of maintainers. [15:21:06] ok, makes sense [15:21:37] striker-on-cloudvps was using local keystone/ldap, or real keystone/ldap? [15:22:16] local ldap in a docker container [15:22:30] ok, so same deployment as laptop [15:22:32] it used to be local ldap in mediawiki-vagrant [15:22:41] I think the codf1dev and/or toolsbeta deployments could be pointed at the production gitlab. I don't think I put in a feature flag that lets the git project management be turned off, but that could be a solution too. [15:23:12] yeah, the cloud vps project was always a variation on the local dev setup [15:33:19] bd808: another question I have about striker (https://phabricator.wikimedia.org/T359554#10096539) is whether to swap in a different 2fa backend or just make striker a CAS consumer. [15:35:14] The latter seems better but also well outside of my comfort zone. (Better because totally decoupled from whatever rando things wikitech or idp/idm are doing in the future) [15:38:03] andrewbogott: I replied on task for that one. TL;DR is yes to CAS, but not until CAS has 2FA [15:38:25] and by that I mean real 2FA that everyone can use, not the hardware token thing [15:38:37] * andrewbogott nods [16:43:59] ok blancadesal, the striker task in question is now T373461 (I misunderstood the meaning of the previous one which I think is broader and will be done later) [16:43:59] T373461: Striker: use idm for 2fa validation instead of wikitech - https://phabricator.wikimedia.org/T373461 [16:44:07] and now I will stop pinging you [16:58:23] I don't understand the point of T373461 [16:58:23] T373461: Striker: use idm for 2fa validation instead of wikitech - https://phabricator.wikimedia.org/T373461 [17:00:19] The comments on the task talk about a test deploy of idm that can do some 2FA check, but that would be some brand new 2FA system with no tokens for anyone wouldn't it? [17:03:21] It's possible that i'm still confused! [17:03:44] My expectation is that for now we'll just swap in a different endpoint for totp validation in striker/horizon [17:03:56] and replacing the auth with SAS will be left for a later day [17:04:07] (because I'm not clear that we're ready for the full SAS package yet) [17:04:21] ok, but that validation has to actually validate something right? [17:04:32] but the test deploy is for test/dev, when we're ready to throw the switch that will get turned on in prod idm [17:04:57] Pointing the TOTP check somewhere that nobody has TOTP is the same as turning the check off [17:05:58] I believe that before we throw the switch we will copy over the keys from wikitech to idm. [17:06:21] And then there will be an awkward period when there are two backends that support the same totp keys but we'll try to keep that as short as possible. [17:06:48] ...does that seem possible so far? [17:06:51] IDP/IDM *must* support 2FA before we can take wikitech SUL. Until IDP/IDM are ready to handle the full 2FA lifecycle I do not see any benefit of changing client code. It doesn't make things happen faster. [17:07:26] Coding up the change to make Striker use CAS makes much more sense to me. [17:07:57] IDP will be enforcing 2FA for 2FA enabled accounts and Striker will never need to think about it again [17:08:53] Striker can change to CAS auth without IDP having 2FA. That's not perfect world ideal, but it is already how gitlab, gerrit, and a lot of other Developer account usage works [17:09:10] Horizon is probably a different mess... [17:10:18] Ok -- so first change striker to use CAS /and/ the wikitech 2fa? And then rip out the 'and the wikitech 2fa' bit when 2fa is turned on in IDP? [17:10:31] no, just CAS [17:10:49] Oh, I see, and then it just won't have 2fa for a while? [17:10:56] yeah [17:11:03] I guess I can live with that. [17:11:50] I think I'm confused because I feel like I keep getting a vibe from the SSO people that they're supporting totp as a temporary transitional thing and planning to 'really' implement 2fa in CAS later on. I don't really understand what that would mean or what the distinction would be (and also it seems like that would break everything) [17:12:08] But I guess if striker just uses CAS then I don't have to care what the future of CAS is. [17:12:34] to be very honest, we made a mess by making Horizon and Striker support 2FA when nothing else using Developer accounts does [17:12:48] * andrewbogott nods [17:14:09] It was fun to fix that mess a bit when I made Striker, but it was all a huge loss when SREs decided that 2FA was somehow not needed in their SSO implementation. [17:14:58] now that SRE is taking ownership of pretty much everything about Developer accounts they need to decide on the 2FA solution and make it all work. [17:15:38] Horizon should be figuring out how to use CAS SSO too so the pain remains in one place [17:16:23] I'm not sure how feasible that is (horizon/CAS) but I also haven't tried yet. [17:17:39] um... now this means we need idp/idm/CAS in the striker test/dev environment doesn't it? [17:17:45] yup [17:18:13] andrewbogott: noted, I'll take a look at it tomorrow and start asking Qs [17:18:41] Probably I'll change direction another half-dozen times before then :( [17:19:27] I web searched up https://docs.openstack.org/keystone/latest/admin/external-authentication.html but have not read it yeat [17:20:32] It would be slick to have keystone /only/ talk to CAS but that raises some questions about how the cli works. It might just work [17:21:06] heh. that page seems to boil down to "'external' means trust REMOTE_USER` [17:23:30] yeah, and REMOTE_DOMAIN [17:26:49] Actually, I don't think that page says much. REMOTE_USER is how keystone tells the external auth plugin what user to auth [17:26:55] but it doesn't really say what happens after that [17:28:00] no, it says both things. I'm baffled. [17:28:07] * andrewbogott will read the source after lunch [17:53:56] REMOTE_USER is the common envvar for Apache2 or another webserver to set after validating the response to an 401 Unauthorized command. REMOTE_DOMAIN would be a custom thing allowed by the keystone plugin. I do also wonder how this would work for cli workflows, but that could be handled with application credentials I guess that one got via a web flow. [20:10:05] I see internal docs for both cas and oidc, is there any reason for me to prefer one or the other? (oidc is the one I can find keystone docs for) [22:37:50] I don't know of any particular reason to pick the CAS protocol over OIDC other than client support [22:39:02] It looks like the analytics superset deploy is using OIDC per https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Superset/Administration#OIDC_login [22:39:49] and bitu too? https://wikitech.wikimedia.org/wiki/IDM/Design#Authentication [22:48:37] OIDC is functionally OAuth 2.0++ with the ++ part being codifying how to get profile info about the token barer. Getting profile info is added on to many OAuth 1.0a and OAuth 2.0 providers as an ad hoc service. MediaWiki's OAuth 1.0a added this with Special:OAuth/identify and for OAuth 2.0 we added an `oauth2/resource/profile` endpoint [22:50:51] The MediaWiki OAuth 2.0 implementation is almost OIDC, but there was something about it that doesn't quite meet the spec.