[00:18:37] taavi: the rabbit hole you helped me find (putting an IPv6 proxy in front of the Magnum k8s cluster for zuul) keeps getting deeper. I need TLS certs now. Do I have to build an Acme Chief instance to provision certs or is there is a lighter way with Puppet CA or something? [00:19:13] Or I could back up one step and stick with the Cloud VPS shared proxy as my ingress [01:28:27] bd808: you need to roll your own, not use a ready-made networks->proxy proxy? [01:28:36] oh, I guess that's what you just said :) [01:31:59] Yeah. I have the shared proxy working. Taavi seemed to think that was at least a little bit wrong somehow. That may have been more that he thought I could just put IPv6 on the cluster API nodes. That turned out not to work because Magnum doesn’t support dual stack configuration. [08:29:43] morning. toolsdb is STILL lagging, but this time for a different query on a different db (: I will add details in T398170 [08:29:44] T398170: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2025-06-30 - https://phabricator.wikimedia.org/T398170 [08:57:57] bd808: hmm. why does that proxy need to terminate TLS? isn't running it in TCP mode and relying on the kubernetes pki setup enough? [09:19:58] this enables our ci deb building pipeline to build multiarch (specifically, arm64 too) https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/64 reviews welcome [09:24:21] nice! [09:46:58] the misctools side of it https://gitlab.wikimedia.org/repos/cloud/toolforge/misctools-cli/-/merge_requests/1 [09:47:11] (not working as it needs the cicd merged to use the pipeline changes)( [09:47:42] re tools-static, what do you all think about splitting the nfs requests to a separate VM, to isolate those issues fron cdnjs/fontcdn requests? [09:48:21] can it be two different nginx processes instead? [09:48:33] good question [09:50:42] the restarting of nginx seems to be enough to get it unstuck [09:50:51] (the script did a bunch so far) [09:53:42] actually we could do that, or put, say, haproxy in front of nginx to handle cdnjs/fontcdn which might be simpler [09:57:04] sounds good to me [09:57:32] yeah. not sure why I did not think of that. thanks! [09:57:35] not a fan of haproxy logs/instrumentation, but well, good enough [10:23:16] * dcaro lunch [12:25:10] tools-db replication is back on track. I added more details to https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication#Transactions_way_slower_on_the_secondary_than_the_primary [12:33:55] yep, one of the drawbacks of row-based replication [13:03:50] dcaro, is this anything you've seen/heard of? T398389 [13:03:51] T398389: ceph-mon 16.2.15+ds-0+deb12u1 uses all the RAM - https://phabricator.wikimedia.org/T398389 [13:04:28] I looked for a moment into it, but did not find anything nope, there's no longs from the time of the issue either [13:05:08] I know that mons sometimes increase memory usage to do things like rebalancing and such, as they need to know the whole state of the cluster to shuffle things around, but it should not need to rebalance anything there [13:05:51] andrewbogott: I see the node is up and running now, though on v15, was it put in v16 from scratch before? (instead of upgrading os -> put mon in cluster -> upgrade ceph to v16) [13:06:31] I reimaged it again on Bullseye because the Bookworm build was such a disaster. [13:06:33] (not that there should be anything wrong with that, but probably introducing a new mon on a newer version is not a paved path) [13:06:42] Was thinking next I'd try v16 with bullseye rather than bookworm [13:07:05] Isn't upgrading the mons one by one before osds the correct upgrade path? [13:08:05] ...my notes say to do v16 and v17 on bullseye but since v16 is packaged with bookworm I was hoping we could just do the OS upgrades next. [13:08:07] yep, though upgrading means having it running on the previous version, and only upgrade ceph, as opposed to taking a mon out, upgrading os+ceph, and bringing a new mon on the newer version right away (not sure if that's what you did) [13:08:52] yeah, what I did was definitely abrupt (starting with a totally fresh osd on a new version) [13:09:23] we can do whatever is easier, upgrading ceph first, or os first keeping the old ceph version [13:09:42] but yep, I'd avoid both at the same time [13:10:00] (again, not that it should not have worked imo) [13:10:05] So, next question is: how do I get the v16 packages on Bullseye? Is there already puppet code to add a new apt repo? [13:11:12] not sure if this is what you ask xd, but you can configure the extras repos for the nodes (if the packages are already pulled in by reprepro) [13:12:14] I think that's what I asked, mostly wondering if I have to write new puppet code to manage the extra repo [13:12:15] the thirpardy ones I mean [13:12:36] I'd say no, I think it's a hiera value somewhere, let me check [13:13:08] this might be it https://gerrit.wikimedia.org/g/operations/puppet/+/bde7653715a08f271de3b18e18f80f0c3ad7b8fa/hieradata/codfw/profile/cloudceph.yaml#6 [13:13:23] ok! I'll try that [13:13:33] there might be a non-ceph specific one though xd, but that one sounds ok to me [13:14:38] I don't see a generic one :/, so yep, that'd be it I think [13:16:06] thanks! [13:16:15] let me know if you want any help :) [13:16:30] I will, but only after things go horribly wrong [13:22:06] hahahaah [13:22:51] sorry, had to: https://bash.toolforge.org/quip/DfJNy5cBvg159pQr2m5_ [13:23:19] :D [13:27:01] shall I restart the NFS workers with D states? [13:28:12] andrewbogott: feel free to yep, they seem to be coming in-and-out though, or did you restart any yesterday night? [13:28:30] I restarted a couple yesterday [13:28:43] next time I have to do it I might implement a "pull the list from prometheus" kind of option [13:29:15] we could have an agent on the workers that notices and self-reboots [13:29:43] yep, that would need some safeguards though (not rebooting too often, making sure there's enough other workers, etc.) [13:30:04] Yeah, it could misfire in surprising and bad ways [13:30:35] done that before xd [13:30:54] * dcaro replacing smartness with scars [13:37:36] initial data on toolforge log volume: https://phabricator.wikimedia.org/P78738 [13:39:15] andrewbogott: is https://wikitech.wikimedia.org/wiki/Help:Object_storage_user_guide#Quotas_and_other_limitations up to date? [13:40:26] it is, as far as I know [13:41:43] ok, I will need a lot more objects then [13:42:18] yeah, the default quotas are really just 'try this out,' I was expecting to have to adjust quotas for basically any real use case. [13:42:27] Now that we're convinced the service works we could revisit that [13:44:44] taavi: awesome! [13:45:43] yep, I suspect that logs currently sent to NFS will be way more voluminous, and probably some tools will be several orders of magnitude more voluminous than others [13:45:57] does loki allow per-tool throttling/retention? [13:46:27] btw. self-removal of objects in rados does not yet work (expecting it to be fixed with the upgrade, but currently it does not free space when objects are deleted) [13:46:54] hmm..... andrewbogott ^ that might be one thing that might make the mon use too much memory maybe, running a proper cleanup of the objects? [13:47:07] huh, that's good to know, that might be a large problem for this case [13:47:36] dcaro: we currently do rate limiting during ingestion (so basically per tool+worker), https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/blob/main/components/logging/values/alloy/common.yaml.gotmpl?ref_type=heads#L111 [13:47:59] I think Loki can also do something similar (which would make it global per tool, and not per-worker), but I haven't looked that deep into it [13:48:24] how does the client behave when it reaches the limits? does it drop logs? buffer them up? freeze? [13:48:32] dcaro: ah, so you mean the first time v16 sees our setup it thinks "omg what a mess" and starts frantically deleting things [13:48:34] * dcaro curious [13:48:47] andrewbogott: yep, not sure though, that would be the mgr daemon I think thoughm [13:49:04] that suggests that /maybe/ even if it ooms repeatedly it might eventually get itself out of the mess [13:49:10] there is buffering for a certain amount (see the `burst` parameter there), after which they will be dropped [13:49:21] anyway... soon we will see if the same thing happens with an in-place update [13:50:47] 🤞 [13:50:53] T398447 [13:51:10] taavi: thanks, sounds good [13:54:38] I just now tried to create an openstack project 'mobileapps' and the cookbook noticed that there's a proxy with that name in deployment-prep and aborted before creating the project. Excellent work, whoever put that pre-check in that cookbook! [13:55:12] yw! [13:55:30] (not sure if I did than now xd, I did some at least) [15:09:01] apparently stashbot was away when I first pasted this: T398447 [15:09:02] T398447: Increase tools-logging object storage quotas - https://phabricator.wikimedia.org/T398447 [15:28:02] taavi: It had not crossed my mind to try using it as a Layer 4 proxy rather than a Layer 7 one. I will give that a shot and see if I can make it work. [16:03:06] btw taavi this should fix the schedule issue you were seeing https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/178 [16:07:50] taavi: can you give the latest https://gerrit.wikimedia.org/r/c/operations/puppet/+/1165155 a quick look? [16:08:22] +1 [17:02:50] If anyone is still around... what would make opentofu try to create a project that already exists? [17:03:18] This is me re-running it after it timed out on a previous run, but now the project is there so I don't understand why it tries and then gets an error from keystone about the name conflict [17:03:45] it might not have updated the state [17:03:51] I think that happened to me a couple times [17:04:30] there was a command to run to import it directly, you'll need the project id in openstack and the full tofu path (shown in the error iirc) [17:04:40] check the history, you might find my command xd [17:04:41] oh right, and wmcs.openstack.tofu doesn't do that either? [17:04:52] should put it in the wiki [17:05:17] command history on cloudcumin1001? [17:05:35] I think I used cloudcontrol, not sure now, but yep, command history [17:05:42] should be in irc too [17:06:12] Doesn't tofu look at the actual openstack state, though? Can't it see that the project is there? [17:07:14] but it's not in the terraform/tofu state file [17:07:19] so it has no id to match it [17:07:32] it just knows that there's a project with that name already [17:07:48] the command is to "import" that project with the id you give [17:07:51] so it can match it [17:08:10] But the tofu state file derives from reality doesn't it? So can't I just tell it to update itself? [17:08:22] * andrewbogott worried that he doesn't understand tofu at all [17:09:12] https://www.irccloud.com/pastebin/qZ0JmKSu/ [17:09:21] hmpf.. butchered that message xd [17:09:31] something like `tofu import 'module.project["tools-logging"].openstack_identity_project_v3.project[0]' 8c0be231d33e4d5d8edee6386e58e698` [17:11:53] Running that on what host and in what dir? (I've tried many so far without success) [17:14:19] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/tofu-infra#Timeout_when_applying_creates_object_in_openstack_but_not_in_tofu-state,_failing_on_the_next_run [17:14:41] I used cloudcontrol1006, that has the tofu setup (I think the cookbook tells you on which one it's running) [17:14:53] feel free to reword/improve the title xd [17:14:54] of course, the one place I didn't try... [17:14:57] thank you for documenting! [17:15:48] after that you'll have to run tofu again, and then the cookbook with the --skip-mr thingie so it adds the users/quotas/etc if any [17:16:31] I think, not 100% sure now, might complain because the project already exists [17:17:31] if you have the exact error message you found, can you paste it somewhere in those docs too? for easy grepping [17:19:22] that worked! [17:20:07] yep, I'll look for the error. It's way up there [17:27:22] bah, seems to not be in the log, at least not as it presented on the cli [17:33:05] it's ok, next time :) [17:36:24] * dcaro off [17:36:29] cya tomorrow! \o [17:46:00] * andrewbogott waves [19:54:05] I have been wrapping my head around the tofu state stuff recently too andrewbogott. The tricky bit as far as I have found so far is figuring out what the id is to "import" a thing (which is basically just updating the state tracking). [19:54:07] I eventually gave up on the thing I was trying and did a delete + recreate with tofu instead. [19:54:32] ok -- I was going to ask if adding import_id would solve the issue [19:55:08] if yes then I probably understand what's happening, but in that case it seems like we need that for basically every resource defined by tofu [19:57:05] In that command in the runbook that D.avid wrote, the "8c0be231d33e4d5d8edee6386e58e698" is the magic ID. For the resource I was dealing with, that data turned out to be internally generated by tofu and I wasn't smart enough to figure out how it was pcked. [19:58:53] It sounds like for an openstack_identity_project_v3 resource the id is the OpenStack project id which makes reasonable sense [19:59:23] most of the time the ids are generated by openstack, and tofu has to track them somewhere (in the tfstate) [19:59:40] if tofu created the resource, it should track the id in the state automagically and you don't need to worry about it [20:00:08] if you create the resource outside of tofu (or if the tofu creation fails half-way), you might have to do the "import" thing [20:02:19] tofu will sync attributes of resources it already knows of, but if it doesn't know about a resource (e.g. a project that was created outside tofu) it will just ignore it, unless/until you import it [20:02:31] hope this helps :) [20:04:52] bd808: now I'm curious which tofu resource internally generates an ID [20:05:24] taavi: likewise, it's theoretically possible but very rare [20:09:43] taavi: it looked to me like cloudvps_puppet_prefix did, but maybe I just wasn't able to follow the golang code [20:10:30] that comes from the ENC API, iirc it's the ID of the row in the prefix table. we may not expose that anywhere else, but definitely not only internal to the provider. [20:11:24] :nod: I had tried https://puppet-enc.cloudinfra.wmcloud.org/v1/c26d9d326bdf464fa1025939ded7e5a2/prefix to see if it returned the id for each bucket, but it did not [20:13:16] for "this would require an extra change in the Horizon panel reason I did not have energy back then" reasons I think you need to pass a specific parameter to get the IDs from that endpoint [20:14:14] I go tired of reading source code :) [20:14:18] *got [20:14:58] for the thing I was doing it was easy enough to delete the unmanaged resource and start over with the automation [21:08:36] Given a notebook ID on paws-nfs is there a way to find out what user created the notebook? I'm /so/ curious what this user is doing [21:10:13] andrewbogott: yes you can translate the numeric id to the common id. I have a curl that I would use to do that but I would have to look it up tomorrow [21:10:43] OK! It's far from urgent. [21:11:29] Alternatively just list all of the server pod manifests and look for that user id in it. I believe they show up in there for mounting the home directory [21:11:46] This notebook contains the entire Android source code which means it's probably abuse and the user will ignore me anyway. But there's a 5% chance they're doing something hilarious that I want to know about :) [21:11:48] oh, that [21:11:52] that'll work! [21:23:56] andrewbogott: `sql centralauth 'select gu_name from globaluser where gu_id = 28456187;'` [21:24:44] where? [21:25:15] that works on toolforge against the replicas. I think it would also work in prod from the deployment host. [21:29:12] 0 edits, account created 60 days ago [21:30:29] "Hey look! Free compute!!!" [21:50:09] https://www.mediawiki.org/w/api.php?action=query&format=json&meta=globaluserinfo&formatversion=2&guiid=28456187 is the API method for these sort of lookups for completeness.