[07:35:55] paws seems to be down, looking [07:57:56] morning. can I help? [07:58:36] maybe :), I'm struggling a bit trying to find the openstack auth it's using, and validating it [08:01:45] my best guess is that it's an appcred for someone's personal account (instead of a dedicated service account which it should use) [08:03:56] there's a hint that's using rook for something (user rook not found), but I think that one has been failing for a while (I saw those error logs in the openstack logstash before, when paws was still working) [08:04:33] that's from controller manager pods, for hub itself I think it might be using something different, but maybe it just expired or something [08:08:04] I don't know much about the PAWS setup, what is it using openstack auth for? magnum? [08:08:47] cinder volumes as pvc [08:08:56] (I think, it's failing on the mount) [08:09:12] yeah, that's all managed by magnum iirc [08:10:02] tbh, if it's all authenticated via r.ook's now-disabled account, I suspect we have no other options than to migrate the credentials to some other account (preferrably a new dedicated service account) and re-deploy the cluster [08:11:29] The creds are used by tofu to deploy the cluster then by the cinder to deploy PVCs. They'll need updated in both places in the config [08:15:38] btw. started T398912 [08:15:38] T398912: [2025-07-08] PAWS down - https://phabricator.wikimedia.org/T398912 [08:28:41] hmpf... it stopped complaining about auth not (I did not touch anything) [08:29:24] complaining now of multiattach not allowed [08:30:13] hmmm.... there's no csi-controller-plugin pod anymore though [08:48:55] okok... finally figured out how to test the credentials... befefefe [08:49:21] anyhow... taavi how do I create a service account? [08:49:52] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Service_accounts [08:50:26] xd, it's two pages down on the search results [08:53:13] do we have a paws@... email? [08:55:15] I'll use root+paws@wmcloud.org [08:56:04] does that mail server support plus aliases? [08:56:36] fwiw you'll want to name that account something else than just 'paws' due to T397651, so something like 'paws-deploy' or so [08:56:37] T397651: CAS not letting new Toolsbeta-logging developer account log in - https://phabricator.wikimedia.org/T397651 [08:56:38] I was expecting it to :/ [08:56:52] I set it to `paws-infra` [08:57:05] yeah that works [08:57:56] got the email so the + works :) [09:27:23] I'm getting a bit confused... I created the paws-infra account in bitu, I logged out from idp.wikimedia.org, and logged in with that account, and it shows ok, but then horizon logs me in as me [09:29:15] a private browser window is your friend here [09:29:20] hmm... I think maybe keystone dose not really logout [09:29:21] yep [09:33:50] okok, nice, got the right access, let's create an application credential for it [09:34:33] there's that one checkbox on the appcred creation page (that I can't remember what it's called) that you need to enable for magnum to work [09:35:37] oh, like give all access one? [09:35:59] "Unrestricted (dangerous)" [09:36:20] yep [09:36:21] okok [09:41:43] now to try to fish where those keys are used xd [09:47:35] I think there's an encrypted file in the paws repo or something if you're re-deploying everything, no idea if those can be installed in the existing cluster somehow [09:49:26] there's at least a clouds.conf base64 encoded file with the credentials in it, but there's also the tofu one, that I think might be using something different [09:51:31] there's ec2 creds also somewhere that I guess I'll have to create too [09:53:22] yes. two options for doing that, either set up the `openstack` cli authenticated as the service account you just created, or do it as novaadmin from a cloudcontrol where you can manually set the user to own them when creating [10:08:46] I think this might be it https://github.com/toolforge/paws/pull/490 [10:10:31] what is the development flow there? should I just pull that PR into the bastion and deploy it? Is there any ci that has to run on it? should I merge it before deploying? [10:12:29] i think traditionally it has been merge-then-deploy, but I wouldn't be opposed to doing it the other way this time to keep a clean git history after it's been tested [10:12:47] you probably need to pull in at least https://github.com/toolforge/paws/pull/485, or change it to the VXLAN/dualstack network [10:15:00] let's do one change at a time xd [10:15:09] why is that not yet merged? [10:16:00] (as in, is there any specific reason?) [10:16:19] because we didn't know how changing that would affect the existing cluster, see the comments [10:16:50] tbh I'd much prefer going straight to a VXLAN-enabled network, even if to the ipv4-only [10:16:54] okok, good timing then [10:17:02] how would I do that? [10:17:19] (has VXLAN been tested with magnum?) [10:18:36] just using `VXLAN/IPv6-dualstack`? [10:18:53] or `VXLAN/IPv4-only`? [10:19:02] dualstack you said before [10:19:09] okok, I'll send patch [10:20:56] I think Bryan tested that on the zuul project. vxlan itself doesn't cause issues, but AIUI magnum only allocates and uses ipv4 addresses. though there's no harm in plopping those to the dualstack network and the v4-only is more or less unused at this point, so let's go to the dualstack one [10:21:48] https://github.com/toolforge/paws/pull/491 [10:21:59] I'll deploy that stack then, crossing fingers [10:22:30] approved [10:23:36] thanks :) [10:25:31] tofu seemed to be able to pull the state, that's a good signal xd [10:25:41] oops [10:25:53] │ Error: Error updating openstack_containerinfra_clustertemplate_v1 b2ba7998-a4e8-4f52-a983-d27328f0f9d7: Bad request with: [PATCH https://openstack.eqiad1.wikimediacloud.org:29511/v1/clustertemplates/b2ba7998-a4e8-4f52-a983-d27328f0f9d7], error message: {"errors": [{"request_id": "", "code": "client", "status": 400, "title": "ClusterTemplate b2ba7998-a4e8-4f52-a983-d27328f0f9d7 is referenced by one or multiple clusters", [10:25:53] "detail": "ClusterTemplate b2ba7998-a4e8-4f52-a983-d27328f0f9d7 is referenced by one or multiple clusters.", "links": []}]} [10:30:27] hmm.... I guess I have to do this instead of just running deploy? https://wikitech.wikimedia.org/wiki/PAWS/Admin#Blue_Green_Deployment [10:33:00] https://github.com/toolforge/paws/pull/492 <- new deployment [10:33:53] creating the new deployment... [10:34:10] side question probably for later, why is that using google dns instead of our own recursor? [10:35:04] I noticed yep, 8.8.8.8 right? probably was the easiest to remember xd [10:35:42] hmpf... [10:35:45] │ Error: Error waiting for openstack_containerinfra_cluster_v1 f8a06fa7-2b53-44db-aac0-888a54af9dca to become ready: openstack_containerinfra_cluster_v1 is in an error state: Failed to create trustee or trust for Cluster: f8a06fa7-2b53-44db-aac0-888a54af9dca [10:36:32] hold on, I think https://github.com/toolforge/paws/blob/9c42a38a368b3ad460fd122e60851c67335c3ed2/tofu/vars.tf#L73 also needs changing if we're changing the network [10:38:16] the new subnet name being 'vxlan-dualstack-ipv4' [10:41:04] okok, that does not seem related to the error though no? [10:41:51] yeah, probably not [10:42:17] that error hints that something's not right with the appcred. did you create it with the unrestricted option enabled? [10:42:38] yep [10:42:39] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Magnum#Provisioning_with_OpenTofu [10:42:46] let me see if I can double check [10:43:01] and which roles did you add to it? [10:43:19] member and reader [10:44:19] that seems right [10:44:29] https://usercontent.irccloud-cdn.com/file/q66YNvhE/image.png [10:44:49] maybe I missed some other place in the code [10:45:09] can you manually authenticate to the openstack api with those credentials? [10:46:48] yep [10:47:00] https://www.irccloud.com/pastebin/xb9K4VVK/ [10:47:22] it did create the cluster, so some credentials work [10:47:22] xd [10:47:41] (as in, it triggered the creation, though it failed eventually) [10:48:42] very odd [10:48:48] try deleting the cluster and trying again? [10:48:54] ack [10:50:41] same ` Failed to create trustee or trust for Cluster: 7270043f-d265-4808-a793-ad74c158a6b2` [10:52:14] in logstash there's some logs with just `exception during message handling` from magnum :/ [10:53:18] digging in logstash and the magnum logs on cloudcontrol1006 I found this: [10:53:22] > "keystoneauth1.exceptions.http.BadRequest: Invalid input for field/attribute project_id. Value: paws. 'paws' is not a 'uuid' (HTTP 400) (Request-ID: req-fc263746-a5a0-4da8-82ca-6028459cf07a)", [10:53:27] I suspect you just discovered a keystone bug [10:54:00] full stack trace: https://phabricator.wikimedia.org/P78804 [10:55:06] xd, I was looking at that too, \o/ [10:55:49] https://review.opendev.org/c/openstack/keystone/+/952641 looks very familiar [10:55:52] the stack trace only shows in the cloudcontrol [10:56:03] I wonder if that's not been deployed yet, or if there's a second case of that [10:57:47] yep, that is not applied on our keystone [10:57:59] https://www.irccloud.com/pastebin/QZBBAEnf/ [10:58:27] maybe missing the puppet patch? [10:59:22] hmm... it should [10:59:27] https://www.irccloud.com/pastebin/v2GEMXdu/ [11:00:01] that's epoxy [11:00:18] and we have epoxy installed if I'm not mistaken [11:01:43] gtg. I'll be back in ~30 min.... feel free to continue with this [11:03:04] I suspect puppet might have failed to apply the patch [11:04:06] yeah, I think I found why and am making a patch [11:08:47] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1167189 [11:14:10] dcaro: when you're back, try re-creating the cluster again? [11:37:19] thanks! [11:39:45] currently in progress (promissing) [11:46:22] dhinus: do you know anything about the wiki replica query sampler? I found https://gerrit.wikimedia.org/r/c/operations/puppet/+/989542 while cleaning up old patches and am wondering whether we're still interested in that or whether it should just be removed [11:47:29] taavi: I noticed it a while ago, I think it can be removed, but let me have another look [11:52:54] it seems to be stuck now, the logs show '"ValueError: Field `node_addresses[0]' cannot be None"', that might be related to it failing to get an ip or something [11:53:43] I think I might be deploying the wrong patch xd [11:53:54] did you change the subnet name variable I mentioned earlier? [11:54:30] yep I did, just reset to the wrong hash, now making sure [12:06:21] taavi: thank you for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1167189 -- it is surprising! [12:10:51] it's taking a bit though... how much does it usually take/ [12:10:52] ? [12:22:07] Are you talking about the paws deploy? Is that paws-127b? [12:22:22] yep, it finished :) [12:22:23] openstack_containerinfra_cluster_v1.k8s_127b: Creation complete after 22m4s [id=f79954b8-709a-487d-beda-1e406014d49a] [12:22:28] it failed after though [12:22:38] │ Error: Error updating openstack_containerinfra_clustertemplate_v1 b2ba7998-a4e8-4f52-a983-d27328f0f9d7: Bad request with: [PATCH https://openstack.eqiad1.wikimediacloud.org:29511/v1/clustertemplates/b2ba7998-a4e8-4f52-a983-d27328f0f9d7], error message: {"errors": [{"request_id": "", "code": "client", "status": 400, "title": "ClusterTemplate b2ba7998-a4e8-4f52-a983-d27328f0f9d7 is referenced by one or multiple clusters", [12:22:38] "detail": "ClusterTemplate b2ba7998-a4e8-4f52-a983-d27328f0f9d7 is referenced by one or multiple clusters.", "links": []}]} [12:22:52] it seems it can't update the template in-place while there is a cluster using it [12:23:15] so the network stuff for cluster 127a might not work, might have to remove 127a for tofu to run without problems [12:24:28] you're following https://wikitech.wikimedia.org/wiki/PAWS/Admin#Blue_Green_Deployment right? [12:24:43] yep [12:25:10] but there's also some changes to the network stuff besides just creating a new cluster (and openstack credentials and such) [12:25:14] sure [12:25:24] could you make a new template rather than re-using? [12:25:28] the new cluster is up and running, let's see if it works, if it does I can remove the old one [12:25:34] fair :) [12:29:08] hmmm... I think it might not have deployed some stuff [12:29:56] should I just remove the other cluster? it's not working anyhow [12:32:25] I think I'll do that yep, the new cluster is connecting to openstack correctly, so that's good [12:38:23] hmpf... magnum is failing to delete the cluster [12:38:25] magnum.common.exception.AuthorizationFailure: unexpected keystone client error occurred: Could not find user: rook. (HTTP 404) (Request-ID: req-9fe5ce1c-f790-46a0-80d0-b2145d81dd06) [12:39:24] I think it's trying to delete the trusts [12:41:13] hm [12:41:51] sometimes you can delete things piecemeal starting with 'openstack stack resource list' [12:43:39] or we could temporarily re-enable that account in ldap to make openstack see it again [12:44:01] that might be easiest [12:45:22] how do I do that? the trust id is "4a3c16e880174617804058d8f7aa1ef0 [12:45:53] it needs to be done via idm.wikimedia.org, done [12:46:43] dcaro: can you retry now somehow? openstack sees the account again [12:46:55] ack, thanks, looking [12:46:58] hm, we might also need https://review.opendev.org/c/openstack/keystone/+/953723 which is merged upstream but which I didn't think was urgent in our deployment... [12:47:22] the error changed [12:47:23] magnum.common.exception.AuthorizationFailure: unexpected keystone client error occurred: You are not authorized to perform the requested action. [12:47:59] don't tell me the deletion fails because it's not done with r.ook's account [12:48:23] yep [12:48:31] wait, not sure [12:49:06] not sure why it fails though, but might be [12:49:36] I suspect that it might be trying to use the trust that was created with roo.k's account, and that might not have permissions [12:50:44] I think it's using the trust yep [12:51:04] `"context": {"user_name": "ff6079a2-a672-480e-aebe-06c0534c24a3_paws", "project_name": null, "domain_name": null, "user_domain_name": "magnum", "project_domain_name": null, "user": null, "project_id": null, "system_scope": null, "project": null, "domain": null, "user_domain": null, "project_domain": null, "is_admin": false, "read_only": false, "show_deleted": false, "auth_token": null, "request_id": [12:51:04] "req-7697cf0c-3c20-46dd-87c8-83c38bb5b54d", "global_request_id": null, "resource_uuid": null, "roles": [], "user_identity": "- - - - -", "is_admin_project": true, "auth_url": null, "user_domain_id": null, "user_id": null, "trust_id": "4a3c16e880174617804058d8f7aa1ef0", "password": "*********", "all_tenants": false}` [12:52:27] so is the issue that ro.ok's account no longer has permissions in the paws project? [12:52:34] huh, heat seems to think that the 127a is already gone... [12:52:35] maybe, let me give it some [12:52:36] xd [12:52:48] `openstack stack list` does not show it anymore [12:52:58] * andrewbogott runs 'openstack role add --project paws --user rook member' [12:53:00] try again? [12:53:26] passed! [12:53:40] We can disable the account again now xd [12:53:46] ok! [12:53:48] * andrewbogott removes the roles [12:54:15] thanks both! :) [12:54:16] done. Going to leave the ldap magic to whoever still has that window open [12:54:20] * taavi does [12:54:29] does the new cluster actually work [12:54:30] ? [12:54:58] done [12:55:20] not yet [12:55:26] it's deploying the ingress and such [12:55:32] it's failing to pull an image though [12:55:37] https://www.irccloud.com/pastebin/VtcTZQun/ [12:57:53] bah, quay seems to have changed their auth system since my last visit [12:58:01] (although that shouldn't matter for the paws image) [12:58:19] seems huge [12:58:23] (pulling on my laptop) [12:58:28] that downloads for me locally, but it is an absolutely massive image [12:58:45] 7.29GB [12:59:25] yep [12:59:27] :/ [12:59:42] oh, it finished pulling [13:00:18] and paws seems to be up \o/ [13:00:42] nice [13:00:56] awesome, thank you dcaro! [13:02:14] is https://github.com/toolforge/paws/pull/492 now up-to-date or do you have additional fixes that need to be committed? [13:03:49] that's all, just merged [13:05:09] oh, I think I messed something up? [13:05:16] hmm yeah, I think https://github.com/toolforge/paws/pull/491 went to some wrong branch [13:05:54] oh, github does not change the target branch once the target branch is merged [13:05:59] 🤦‍♂️ [13:06:14] so it did merge it, but to the old branch [13:07:22] oh my, and then it creates a merge commit when you 'update branch', instead of rebasing... I'll fix manually... sorry [13:12:38] okok, merged nicely [13:14:08] andrewbogott: do you know anything about the wikitech-static alert at https://alerts.wikimedia.org/?q=team%3Dwmcs ? [13:15:02] the one about version mismatch you can ignore, the one about content I haven't investigated in depth. I thought it was firing because of a DOS but that's fixed and it's still firing [13:16:01] hmmm [13:16:08] yeah, it looks like it hasn't updated in a week or so [13:16:16] I can take a look [13:18:46] it's because https://dumps.wikimedia.org/other/wikitech/ stopped updating at the end of June [13:20:19] hmm maybe this is when dumps generation was modified by data platform? cc btullis [13:23:25] worth filing a task for them I guess? [13:24:09] Oh, right. Yes, sorry, that was probably my fault. [13:24:32] do you want me to file a task in phab? [13:24:38] I had assumed that it was no longer necessary, now that wikitech is just like any other wiki. [13:25:02] https://dumps.wikimedia.org/labswiki/ [13:25:34] so it's there but under a different path? [13:25:55] the "regular" dump is twice-monthly compared to the special dump which was daily [13:26:37] I am trying to redesign all that but I'm stalled so we will need those daily dumps for a while yet [13:27:03] Ah, OK. Yes that makes sense, especially if that dump ise used to reconstitute wikitech-static. It was added to normal dumps after this: https://phabricator.wikimedia.org/T374952 [13:28:18] You should probably keep the twice-monthly dump as well since that will be what we want if/when we stop using the dailies [13:29:16] We have been talking about doing our own end-to-end integration test of dumping a wiki daily, to make sure that dumps-on-airflow works, so maybe this would be a good candidate. Hold on a sec before creating that ticket... [13:29:38] ok [13:40:31] I've created T398968 anyway, feel free to edit/merge with other tasks [13:40:32] T398968: wikitech-static: resume daily dumps - https://phabricator.wikimedia.org/T398968 [13:40:44] On second thoughts, it's not a great candidate for our integration test because the format is different, https://github.com/wikimedia/operations-puppet/blob/production/modules/snapshot/files/systemdjobs/wikitechdumps.sh [13:41:21] We can backfill for now, to get your alerts cleared, then we will add this specific wikitech/labswiki dump to Airflow as well. [13:43:52] thanks btullis [13:45:28] thank you! [13:58:11] The dumps for the last 8 days are now published. Will wikitech-static update itself, or does something else need to be kicked to make it update? [13:58:46] thanks btullis I think it will update automatically but I don't remember the details... [13:58:54] I can double check later if the alert is still there [13:59:07] Cool. I'll get to work on a DAG for it. [13:59:16] I think there's a nightly timer that picks up those [14:04:02] yeah, it will update itself after a few hours [15:02:05] taavi: quick review, is this what you requested in the task for the toolconfig schema? https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/98#632faceae95d5aad1c3681b5524c2aa401f0f7d3 [15:04:59] dcaro: on a very quick look that indeed seems it, for a more thorough review you'll need to wait until tomorrow [15:05:35] it's not urgent, I can wait 👍 [16:48:43] andrewbogott: I discovered some neat things that Magnum sets up yesterday and I'm looking into using one of them. There is a webhook driver that lets the k8s cluster use Keystone to authenticate. It also lets you map OpenStack roles to k8s RBAC roles. [16:49:56] This leads to asking you if it might be reasonable to create a few OpenStack roles for this RBAC mapping function if/when I discover that is needed. [16:50:20] Sure, creating roles is easy if you need different ones from what we've got. [16:50:44] cool. :) [16:50:54] https://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/keystone-auth/using-keystone-webhook-authenticator-and-authorizer.md is the thing [16:52:06] When you set `cloud_provider_enabled = true` in the Magnum config it installs the things from https://github.com/kubernetes/cloud-provider-openstack [16:57:28] andrewbogott: tangentially related question -- I see a profile for Barbican in Puppet, but `openstack secret list` tells me "public endpoint for key-manager service in eqiad1-r region not found". Does that mean we don't have Barbican at all, or just that it is not exposed to that user account? [16:58:14] tested in codfw1dev but not implemented in eqiad1 (and not very close to it either, since it's just the API and we don't have a backend for storage) [16:59:13] ok. is that part of the replacement for heat in Magnum or a different thing entirely? [17:05:03] * dcaro off [17:05:06] cya tomorrow! [17:06:49] bd808: I'm pretty sure we don't need it for the magnum update. Ideally we'll be able to re-use whatever prod SREs come up with for secrets. [17:09:38] as long as they pick openbao I guess ;)