[00:10:34] So that's just the ending . that it's mad about? [07:34:18] greetings [09:08:01] if anyone has an answer for https://phabricator.wikimedia.org/T428867#12008796 it would be great to know ;) [09:25:18] how did we not get notified about T428867? I would have expected at least the generic systemd failure alert to fire for that [09:25:18] T428867: Openstack cinder volumes backups are broken - https://phabricator.wikimedia.org/T428867 [09:27:06] the unit is clearly failed, I noticed because I was looking at backups [09:27:16] I'll look at the monitoring after fixing it [09:27:18] ● backup_cinder_volumes.service loaded failed failed backup cinder volumes [09:29:07] re your original question, I want to say that it's fine and there is already a bandwidth throttle, the run will just take a while [09:29:15] volans: ^ [09:29:17] ack thx [09:30:24] could it be because team-wmcs/general_systemd_unit_down.yaml is # deploy-site: eqiad ? [09:30:32] godog might be able to confirm ^^ [09:30:50] I'm not that familiar with the alerts repo syntax [09:30:58] sounds possible :/ [09:31:34] yeah that's it, the production systemdunitfailed excludes team=wmcs [09:31:43] ... ok that needs fixing [09:32:39] I'll open a subtask [09:32:50] thx [09:32:55] I've added https://phabricator.wikimedia.org/T428867#12008912 [09:33:00] for backlog [09:54:35] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1300730 should fix it AFAICT [09:55:15] yes it will use the proxies also for backup1* hosts, but I think it's the right thing to do as they are currently working by luck [09:57:05] I was just about to ask about that [09:58:05] https://phabricator.wikimedia.org/T428867#12008873 [09:58:12] for the why they work :D [09:58:21] volans: if the instance backups are also done via the proxies, then you need to update the http proxy access control rules for that [09:58:32] it's in the patch no? [09:58:35] but I would also say it's safe to rely on the hosts in the same site having cloud-private connectivity [09:58:50] you have 'wmcs::openstack::$DC::cinder_backups' only there [09:59:19] right, my bad [09:59:23] they all have different names [10:01:40] fixed [10:14:32] btw we have an "urgent" restore request for some toolforge nfs files: T428830 [10:14:33] T428830: Urgent: Backup restore request (Toolforge) - https://phabricator.wikimedia.org/T428830 [10:15:03] are those backups among the broken ones? [10:15:08] dhinus: nocando, that's cinder hence codfw [10:15:14] so yeah we have it from like march [11:05:08] taavi: weird that cloudbackup100[1-2]-dev were not broken even doing cross-site [11:05:27] do we have a firewall open in codfw1dev that was not closed when the eqiad1 one was closed? [11:05:59] volans: excellent question, let me check [11:08:35] thx [11:10:28] * volans wondering if I should force the backup now in one host, hoping it will not overlap with itself at the next timer run 19UTC cloudbackup2003, 07UTC cloudbackup2004 [11:19:41] damn, there is a bug in the fix, debugging now [11:26:05] ok found [11:29:08] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1300758 [11:59:25] I should have manually fixed all the alerts related to the mwopenstackclients.py bug I created and then fixed [12:10:12] * taavi afk for half an hour [12:45:29] checking puppet on tools/toolsbeta, I see a few alerts and looks like agent runs are quite slow [12:45:36] slower than usual anyways [12:45:42] I was already [12:46:17] oh nice, any leads ? [12:46:25] Waited 120 seconds and a preceding puppet run is still ongoing, aborting [12:46:28] good catch volans. Once again we learn that if you aren't doing regular restore tests, you don't really have backups :( [12:46:50] Retrying mwopenstackclients.Clients.keystoneclient in 7.602118072966442 seconds as it raised AttributeError: 'Clients' object has no attribute 'proxy_url'. [12:47:01] 2026-06-11T12:46:29.601Z WARN [qtp520446076-952121] [c.p.p.ShellUtils] Executed an external process which logged to STDERR: Retrying mwopenstackclients.Clients.keystoneclient in 6.778393346399981 seconds as it raised AttributeError: 'Clients' object has no attribute 'proxy_url'. [12:47:01] damnb, but is fixed [12:47:21] don't tell me that puppet breaks itself without it [12:47:23] this is on tools-puppetserver-01.tools.eqiad1.wikimedia.cloud I guess a puppet run is needed [12:47:29] yes :( [12:47:38] try to run puppet there [12:47:45] if it fails I can do manual fix [12:47:50] ok doing [12:47:51] which part of a puppet run needs that? the enc client? [12:48:25] I think so yeah [12:48:34] my previous patch might have broke puppet in a weird way, sigh [12:48:52] volans: would you mind applying the fix on the puppet server? puppet runs are quite slow [12:48:57] sure [12:49:13] wondering how many other puppetservers might be affected though [12:49:22] i guess this is going to be broken in all the project puppetservers, not just tools? [12:49:36] likely [12:49:38] regarding that user restore request... those March backups are about to get wiped out by the now-working backup service aren't they? We certainly don't keep 60+ days of backups [12:49:55] although to be clear I wouldn't feel bad telling the user we don't have backups since that's usually what we tell users [12:50:02] andrewbogott: I was thinking the same [12:50:24] so dhinus if you want to do a restore of the old nfs volume, you need to do it /right now/ [12:50:27] is the retention based on age or number of backups? [12:50:33] age [12:50:37] godog: /usr/lib/python3/dist-packages/mwopenstackclients.py fixed on tools-puppetserver-01 [12:50:48] backy2 has 8 days [12:50:50] probably an another thing to fix is to ensure at least N backups are stored? [12:50:52] in the config [12:50:55] volans: ack thx [12:51:14] taavi: yeah, if that's possible with backy2 config. Can you note that someplace? [12:51:23] (or I guess volans probably has a running doc to note it on) [12:51:24] I would keep last N days (small) + one from 2 weeks ago + one from 2 months or something like that [12:51:25] andrewbogott: I don't remember the exact procedure but yes I think we should try to help that user :) [12:51:30] (if we can) [12:51:39] Any reason to think a March backup is of any use for them? [12:51:42] dhinus: I can extract the files if needed but will be old [12:51:54] we can ask... [12:51:57] might as well get a copy them while we can, better than nothing [12:52:03] ok let me do that [12:52:10] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Cinder_backups#Restoring_a_single_file_from_a_backup [12:52:31] andrewbogott: will file a task [12:52:31] I'm familiar with that page unfortunately :D [12:52:38] :/ [12:52:42] hrm, apparently we have both https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Backy2 and https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Cinder_backups [12:53:08] the former should be renamed to reflect that it's about VM / backups [12:53:09] iirc [12:53:36] or just merge them together? [12:53:43] that would also be fine :) [12:54:20] volans: ok with the fix in place I killed the hung enc and tools puppetserver is back on its feet [12:54:20] volans: are you doing the needful for that user restore? (just making sure that the 'we' in backscroll resolves down to an actual pair of hands) [12:54:33] godog: thx [12:54:41] andrewbogott: yes [12:54:45] ty! [12:55:22] this is the volume right? https://horizon.wikimedia.org/project/volumes/49bc0222-14cc-4bfb-92e4-66d02792a649/ [12:55:29] tools-nfs [12:55:46] yes [12:55:50] T428897 [12:55:59] T428897: Cloud VPS cinder backy config should ensure a specific number of backups is kept - https://phabricator.wikimedia.org/T428897 [12:57:00] * andrewbogott keeps using passive voice because he just woke up and isn't ready to actually touch dangerous things [12:57:12] dhinus: wait for asking [12:58:17] volans: wdym? [12:58:25] if you were asking the user [12:58:33] ah ok [12:58:53] backy2 ls doesn't show anything, despite the process still running, how can it do cleanup before backup given that it works with diffs? [13:00:04] it was showing old stuff earlier, I'm sure of that [13:00:18] let me try anyway [13:00:34] I'm not 100% sure that the cleanup process is coupled to the backup process... [13:00:40] volans: cleanup is done first if i understand https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/wmcs/backup_cinder_volumes.pp#42 correctly :( [13:01:21] also this made me remember of T358780 :P [13:01:21] T358780: [wmcs-backup] Race condition between backup and cleanup timers - https://phabricator.wikimedia.org/T358780 [13:01:24] sob :( [13:02:14] :( [13:03:01] I have so many questions... [13:03:35] there are still 30T of used space according to df, backy2 ls is empty, but is running since a while, possible that it hasn't finished yet not even a single volume? [13:04:55] yeah, possible [13:05:09] /usr/local/sbin/wmcs-backup volumes delete-expired still running [13:05:36] volans: regarding your question about incremental backups above... it does full backups every X days so is surely doing that now. Also, the increment isn't based on the previous backup but based on a snapshot that lives in ceph. [13:05:50] true [13:09:03] for cloudbackup2003 we could bump the retention to 200 days to be safe and let it run one day or two like that [13:09:10] before allowing the cleanup to delete all the olde backups [13:13:36] that seems like a good idea [13:16:01] who's going to break the news about the restore request? [13:19:44] soo on the side of the broken puppet we might got lucky as it seems that cloudinfra-cloudvps-puppetserver-1.cloudinfra.eqiad1.wikimedia.cloud is somehow stuck at ab60dff5566a7a433e1ab50184ab8f0a0a8ceee1 Date: Wed Jun 10 15:24:30 2026 +0200 [13:19:50] checking the sync process now [13:38:09] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1300802 for the retention [13:51:25] I have some concerns, the delete-expired process is running very slowly, but apparently it has already cleared the metadata (backy2 ls being empty) but 27TB of blobs are still there [13:51:56] I wonder if a) there is a way to be able to recover something or b) is quciker to interrupt, rm -rf and get it start from fresh [13:54:48] a) I doubt it unless you have a clever way to undelete the metadata [13:55:14] b) it doesn't surprise me that the cleanup is taking a while, I doubt restarting would help. [13:55:22] I found weird backy2 doesn't show the backups still available if they are deleted one by one [13:57:22] agreed [14:31:55] Raymond_Ndibe: coming to/leading today's meeting? [15:24:44] do we have any policy for puppet being running on the VMs? [15:25:12] (I've run a last-puppet-run on a bunch of VMs to check the fix for the puppetmasters) [15:26:29] the unwritten policy is that "yes, instances should be running puppet successfully" [15:27:03] is "435448 minutes ago" considered successfully? :-P [15:27:18] eventual consistency [15:28:38] volans: my math says there should have been 14,500 successful runs since that, so I would say no :P [15:28:53] :D [15:49:54] FYI actual backups have started and they are showing up on backy2 ls [15:50:42] do we have concerns of both cloudbackups running at the same time? I can make cloudbackup2003 skip tonight run and restart tomorrow... not sure what's best [16:09:05] andrewbogott: yeah, it didn't like the terminating `.` in the dns name, but after fixing that we're up and running! [16:09:15] https://www.irccloud.com/pastebin/pXkVuY7f/ [16:09:19] happy day :) [16:09:50] Awesome [18:58:50] dduvall: I tried to add our learnings to the bottom of https://wikitech.wikimedia.org/wiki/Help:Managed_Kubernetes -- please expand that section if you have time and/or thoughts. And, for that matter, feel free to totally rework that page if it's a useful distraction. [18:59:14] oh shit I said 'learnings' instead of 'lessons' what is happening to me [18:59:17] andrewbogott: nice. i'll have a look after mw train [18:59:21] haha [18:59:24] drink! [19:14:52] andrewbogott: something i've wondered about. would it be possible to have cluster templates that are managed by you all and consumed by clusters in our downstream projects? [19:15:47] would definitely be possible, if it turns out that there are standard cluster types that multiple projects want to use [19:16:41] from my experience now with both heat and cluster-api, i think that would take most of the headache out of the process, since it seems mostly about getting the right labels set [19:17:14] and avoiding edge cases (like with the need for a master lb in capi) [19:18:45] it's a good idea, I think global templates are already a supported thing. [19:18:48] although i'm sure it's not all roses you guy all, especially if there end up being a bunch of downstream clusters [19:18:52] nice [19:19:01] Would also make this a lot more practical for those who want to do it all from the webui [19:19:15] s/roses you/roses for/ (weird brain) [19:19:38] I don't know if this product will ever have more than a count-on-one-hand number of users but a guy can dream [19:19:48] :D [19:20:05] it's cool when it works [19:21:15] i think other than the templates, the upstream terraform/tofu provider really needs work around state and polling. it fails pretty much every time when deleting because of what seems to be a race condition between the first poll and the cluster state [19:21:54] not a huge deal because deleting an entire cluster isn't something we will do a lot, but still [19:22:02] huh, I haven't seen that with paws. What does it report when it fails? [19:23:42] `│ Error: Error waiting for openstack_containerinfra_cluster_v1 6d8b1a0c-abbd-4dbf-be7d-86f5cd483749 to become deleted: unexpected state 'CREATE_IN_PROGRESS', wanted target 'DELETE_COMPLETE'. last error: %!s()` [19:24:22] is that happening because you're using tofu to delete a hung installation? Or does it happen even if you have an actual healthy cluster? [19:24:27] my suspicion is that it polls before the delete even starts [19:24:44] i believe it happens in both cases, but i can't verify that [19:24:47] atm [19:25:18] ok. I could understand it checking to make sure nothing goofy is happening before it starts, and the cluster being in a create state might register as goofy [19:25:27] although it really shouldn't for slow operations like this one