[09:56:13] minor side quest for ipv6-enabled proxies: https://openstack-browser.toolforge.org/proxy/ now displays project names instead of IDs in the project column [09:58:29] nice [09:58:51] names are still unique right? [10:01:29] yes [10:02:59] 👍 I was not sure how that ended up with all the ids discussion [10:13:51] i wrote a tiny script to update existing security groups filtering on 172.16.0.0/21 (the old vlan subnet) to filter on the new ip ranges instead: https://phabricator.wikimedia.org/P75871 [10:14:07] it works fine on the few small projects i tested it in, planning to run it globally in a few moments [10:16:13] ack [10:26:57] actually, s/in a few moments/a bit later when a.dnrew is around as well/ [10:33:24] 👍 sounds reasonable :) [10:42:19] taavi dcaro can I get a +1 to https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/228 ? [10:44:27] dhinus: lgtm, not sure how the deployment must go though (if there's any special flow or something needed) [10:45:13] I hope that merging and applying will work fine, worst case I think tofu apply could fail if the API returns a conflict, and re-applying a second time should fix it [10:47:23] okok [10:48:15] applying now [10:48:56] worked fine on the first run [10:49:20] "Apply complete! Resources: 2 added, 0 changed, 2 destroyed." [10:49:41] dig returns the expected result [10:54:54] \o/ nice [10:54:57] * dcaro lunch [11:11:17] I checked what's causing the alert for tofuinfratest: it's no longer magnum, it's trove failing to create a database. I cleaned up old instances in state ERROR, retrying now [11:12:02] and now magnum failed again with "Unable to create a new web proxy for tofuinfratest.wmcloud.org: Internal Server Error" [11:12:47] but trove worked fine this time [11:14:33] maybe the web proxy is not related to the magnum error actually [11:18:58] webproxy failure seems to be a designate permission issue [12:29:18] hmm... it seems that pkginfo on bullseye fails with the latest builds of packages (that use wheel metadata version 2.4), that's why ci was failing [12:29:36] (the scheduled pipeline that upgrades poetry dependencies) [12:30:36] https://www.irccloud.com/pastebin/IQNdkK6M/ [12:31:16] how far away are we for not caring about the old buster grid bastion? (and so being able to not care about anything pre-bookworm)? [12:32:13] this is quite close to being ready https://gitlab.wikimedia.org/dcaro/toolforge-bastion-container [12:32:32] but will need testing from the user side (probably changing some flows) [12:33:41] it's been a bit since I tested it though [12:34:16] it should be picking up the config from the environment, and using the creds from the tool home, it worked out of the box to run toolforge cli last time I checked [12:46:08] taavi: I'm not all the way up/awake yet but that security group script looks good. And also that weird max/min thing is surprising! [12:47:55] andrewbogott: cool, I'll run it globally then [12:48:26] we probably need it in codfw1dev too don't we? [12:51:55] andrewbogott: umh, somethting's not working right [12:52:04] getting a bunch of failures for 'Quota exceeded for resources: ['security_group_rule'].' [12:52:40] hm, any chance the script is creating all the new rules in the auth project rather than in the target project? [12:52:50] definitely possible :/ [12:53:10] i'm not 100% sure yet but it seems like the neutron client interpreted that as "already exists" and proceeded to delete the old rule anyway [12:55:14] https://www.irccloud.com/pastebin/d66siDHX/ [12:55:22] that's a lot of security group rules! [12:55:31] Basically harmless though [12:55:41] * andrewbogott looks at script again with benefit of hindsight [12:57:17] Around line 78 you need something like "project_client = clients.neutronclient(project_id=project)" [12:57:21] andrewbogott: yeah, I think this has been dropping the old rules without realizing that the creation of a new one succeeded [12:57:25] and then pass that client to backfill [12:58:04] ok -- so we also need to do some excavation to recover the lost rules, huh? [12:58:47] seemingly yes :/ [12:58:50] got an alert for quarry [12:59:05] that might be from this, sorry [12:59:12] okko [12:59:14] dcaro: there are going to be lots of alerts, can you be on silence-alerts duty while we sort this? [12:59:20] ack [12:59:29] andrewbogott: i think most of those are the default ssh rule which should be easily backfillable, but the hard part is everything else [12:59:44] taavi: I propose a db table restore. In /srv/backups there are recent dumps of the neutron db [13:00:15] andrewbogott: do you know if neutron will automatically pick up changes to those after restoring the table? since they won't be updated via the api [13:00:21] but in theory that sounds like the bset plan [13:00:29] I don't know but I would restart neutron after anyway [13:00:36] sgtm [13:00:43] and let's also dump what we have /now/ manually just to be extra cautious [13:00:56] andrewbogott: dcaro: i'm also formally declaring that as an incident [13:01:23] Are you comfortable doing the mysql magic? I don't immediately know how to selectively import just the one table (which seems safer than restoring the whole dump) [13:02:13] i can have a look [13:02:18] did you take the current state dump already? [13:02:26] I'll be incident coordinator, creating doc and such [13:02:28] nope, haven't done anything yet [13:02:30] thank you dcaro [13:02:32] thanks, was just about to ask [13:03:14] Give me a couple minutes to do part of my morning routine, then I'll be more useful [13:03:35] andrewbogott: do you know offhand the table name? [13:03:57] answering myself: securitygrouprules seems like [13:04:32] the current state is dumped to my home directory on cloudcontrol1011 [13:05:07] (sudo mysqldump -u root neutron securitygrouprules > securitygrouprules.sql) [13:06:04] seemingly all 3 cloudcontrols take database backups at the exact same time? [13:06:33] new doc in https://docs.google.com/document/d/1brh1R4wZyLuVr0MksoizqbqZqgPLqTkQrk_CKxdRgmI/edit?tab=t.0 , writing things [13:06:50] * andrewbogott is back [13:06:59] * dhinus is also back [13:07:20] taavi: you only need to restore on one, galera should take care of the rest [13:07:35] andrewbogott: yeah, I was just checking where the most up-to-date backup was [13:07:53] ah, yeah. would be better to stagger them [13:08:13] but it's also unlikely that there have been user adjustments in the meantime, it's a pretty rare activity. [13:08:27] andrewbogott: I used https://stackoverflow.com/questions/1013852/can-i-restore-a-single-table-from-a-full-mysql-mysqldump-file to extract the specific table we want, it's in cloudcontrol1011:~taavi/restore/securitygrouprules.sql [13:08:39] I think the next step is to `TRUNCATE` the current table, and then load that file [13:09:14] anything before i do that? [13:09:28] I'd feel better if we try this same operation in codfw1dev first [13:09:56] Just to make sure there aren't unforeseen external key integrity issues &c. [13:10:01] that cause neutron to crash [13:10:17] It won't be a perfect test but it's a lot better than nothing [13:11:00] ok, did the same prep on cloudcontrol2004-dev [13:11:05] should i go ahead there then? [13:11:08] yep! [13:11:56] ERROR 1231 (42000) at line 25: Variable 'character_set_client' can't be set to the value of 'NULL' [13:12:30] after patching out that row from sql it seemed to load [13:12:49] btw did you run the broken script in codfw1dev earlier? [13:13:00] I ask because I was looking at a group in codfw1dev that was empty and now is populated [13:13:17] no [13:13:17] no [13:13:25] Oh, but you truncated the table [13:13:28] yes [13:13:28] so that's what I was seeing surely [13:13:39] ok, anything else before continuing? [13:13:42] ok, now I'm running the cookbook to restart neutron services in codfw1dev [13:13:46] (not the router, just the other bits) [13:14:09] done [13:14:26] nothing crashed, let's see if I can still ssh... [13:14:47] I can! [13:14:49] * taavi still waiting for an explicit yes or no for continuing [13:14:55] so go ahead and restore on eqiad1 [13:14:59] cool [13:15:02] that's the explicit 'yes' [13:15:30] ERROR 1452 (23000) at line 32: Cannot add or update a child row: a foreign key constraint fails (`neutron`.`securitygrouprules`, CONSTRAINT `securitygrouprules_ibfk_3` FOREIGN KEY (`standard_attr_id`) REFERENCES `standardattributes` (`id`) ON DELETE CASCADE) [13:15:46] is that with the truncate or with the import? [13:15:50] import [13:16:16] hmmm [13:16:34] seemingly standardattributes has data relating to security group rules as well [13:16:54] So we can import that one too. I'm just trying to think what order... [13:17:20] i think standardattributes first, since securitygrouprules references that but not the other way around [13:18:10] sounds correct to me [13:18:28] doing [13:18:34] https://www.irccloud.com/pastebin/CjqmwBsc/ [13:18:53] (deleting standardattributes might delete securitygrouprules though, just backup as much as possible xd) [13:18:54] ahh, it's getting a bit too complicated now ERROR 1701 (42000): Cannot truncate a table referenced in a foreign key constraint (`neutron`.`address_groups`, CONSTRAINT `address_groups_ibfk_1` FOREIGN KEY (`standard_attr_id`) REFERENCES `neutron`.`standardattributes` (`id`)) [13:19:15] argh [13:19:17] hmpf... [13:19:21] address_groups is empty [13:19:28] well I think it's still better to do selective tables rather than the whole dang db [13:19:33] but we always have that to fall back on [13:19:37] i could do a delete from instead of truncate [13:19:45] we'll still have to fix up auto_increments manually afterwards [13:20:25] uh we have a minor problem, which is that address_groups is used for non-security-group-things as well [13:20:25] What do you think, safer to just do the whole db? I don't really know what we lose from that [13:20:42] * andrewbogott checks the timestamp of the last backup [13:20:44] we have backups right? [13:20:52] dropping the whole db would brick recently created instances I think [13:21:05] but I think that's better than nothing [13:21:47] Might any of the security group stuff you're working on interfer with traffic to WMCS? [13:21:50] yes [13:21:58] Excellent, thank you [13:22:04] slyngs: https://docs.google.com/document/d/1brh1R4wZyLuVr0MksoizqbqZqgPLqTkQrk_CKxdRgmI/edit?tab=t.0 [13:22:08] we're going back... 8 hours? [13:22:14] that's not too terrible... [13:22:23] I think we should just do the full restore. [13:22:54] i took a full dump of the current state (sans the already truncated securitygrouprules table) to my home dir in 1011 [13:23:07] andrewbogott: should I take that as a "do it now"? [13:23:09] yeah, good insurance [13:23:13] taavi: yeah :/ [13:24:00] andrewbogott: ok, all done [13:24:18] * andrewbogott restarts neutron services including the routers [13:24:38] `os security group rule list` is taking a while [13:25:19] just had my ssh connection to proxy-03 die, but it's back to reachable [13:47:10] the output looks good, and I've verified that the new rules end up in the correct project [13:47:33] great, thank you for indulging me :) [13:47:45] andrewbogott: just saw, yes please :) [13:47:51] (about the email) [13:48:25] there's a bunch of puppet failures on tools, looking, probably from the incident [13:48:39] https://etherpad.wikimedia.org/p/networkoutage [13:49:06] oh I should include the actual outage window... [13:50:07] andrewbogott: https://phabricator.wikimedia.org/P75875 i suspect none of these were created in the last 8 hours, so maybe we don't need to include that [13:51:36] thanks for the tz math taavi [13:52:07] what does sheepish mean? [13:52:29] dictionary says 'showing embarrassment from shame or a lack of self-confidence.' [13:52:41] puppet alerts clearing out [13:52:46] taavi: ok, so strike the bit about recreating VMs? [13:53:01] dcaro: I don't have to include that, maybe I'm the only sheepish one :) [13:53:02] yes [13:53:37] oh, it implied that I wasn't sheepish. fixed :p [13:53:43] andrewbogott: I'm ok with that :), not sure shame is what I feel though [13:54:06] i think i am a bit :-P [13:54:42] also verified the updated script works fine in a few test projects in eqiad1, and creates the rules in the correct project [13:54:42] hahahaha, those who do things, break things, np [13:55:02] This is by far the smallest of the outages we've caused on the road to ipv6 [13:55:17] yes, but the only one caused by my mistake :D [13:55:22] maybe we need a "I broke CloudVPS and I fixed it" tshirt :) [13:55:24] taavi: ok, go ahead and run the script in eqiad1, that way we avoid having to send a second email :D [13:55:34] I don't need so many shirts! [13:55:47] oh wait, no, i did also cause a full outage for a bit when switching cloudnet drivers by forgetting to change one variable in the new config [13:55:49] anyway, running [13:55:56] (though I might like having them xd) [13:58:07] the script is at least a lot slower this time, i think because it needs to talk to keystone to get a token for each project [13:58:23] yep, that's a familiar behavior [13:59:17] what do y'all think about publishing https://docs.google.com/document/d/1brh1R4wZyLuVr0MksoizqbqZqgPLqTkQrk_CKxdRgmI/edit?tab=t.0 on wikitech? Minus the chat, I assume [14:00:15] I'm ok with the chat too [14:00:37] +1 for publishing (with or without the chat) [14:00:40] I was thinking I'd remove it for brevity... [14:00:43] * andrewbogott creates a wikitech page [14:00:46] I would file an Incident Report, per our process [14:00:51] https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Incident_response_process#Writing_an_incident_report [14:00:53] the irc channel is already logged at https://wm-bot.wmcloud.org/logs/%23wikimedia-cloud-admin/, no need to re-publish that to wikitech [14:01:08] +1 to taavi, maybe just the bolded lines are enough [14:01:30] oh wow I haven't done this for a while, template is much fancier [14:01:49] you can mark it as a draft and we can refine it later [14:02:43] andrewbogott: the script was finished [14:04:41] great [14:05:05] email sent, incident report is at https://wikitech.wikimedia.org/wiki/Incidents/2025-05-07_cloud-vps_security_groups_deleted [14:05:16] I'll fill out more, I just published so I'd have a link for the email [14:05:27] ack, thanks! [14:05:32] thank you! [14:05:47] thanks for the quick response/calm repairs everyone! [14:08:19] oops, I may have clobbered someone's changes [14:08:38] don't worry, go ahead, the mark was hurting my eyes a bit xd [14:08:44] oh, that's all I did too [14:09:22] I'm going to do a test for the quarry alert, I think it should have shown in alertmanager, but maybe as prometheus was affected it did not make it to the prod prometheus [14:09:55] (essentially I'm going to force it to trigger) [14:09:59] fyi [14:11:01] ahh, it's missing the extra labels 💡 [14:11:48] I think team:wmcs should be enough [14:12:07] https://www.irccloud.com/pastebin/yR1lDiXe/ [14:12:22] ack, okok, I'm triggering the alert, might take 5min to get there :) [14:12:25] looks good. I wonder why it reached my mailbox, though [14:12:40] we are project members maybe? [14:12:52] hmm I think that doesn't affect prometheus alerts, but I might be wrong [14:13:41] i don't think it should [14:14:11] ahh, cloud-feed is contact_group 1, and that was set in the default_contact_group_id [14:14:11] the mail was sent to cloud-admin-feed@lists.wikimedia.org [14:14:17] ah ok! [14:14:39] So the issue with the quarry alert is that it appeared on the dashboard but didn't page or email? [14:14:42] probably we forgot to add the labels whenever we added the contactgroup [14:14:55] the issue is that it emailed, but did not appear on the dashboard [14:15:01] *in the dashboard? [14:15:27] oh, huh [14:15:37] in and on both sound right to me [14:19:23] where did you add the extra labels? [14:19:32] ah I see it sorry [14:19:40] in the "projects" table [14:19:43] there it is [14:19:45] https://usercontent.irccloud-cdn.com/file/juG3dFro/image.png [14:20:03] I'll return it to the right expression :) [14:20:28] note that it's not set to page (like paws I think) [14:22:07] I think it's fine NOT to page [14:24:45] I'm done fiddling with the incident report, anyone else who wants to edit it should feel free. [14:25:09] and you can add a little 'x' in the action items section after the alert is fixed [14:27:59] added a couple notes [14:28:24] dhinus: yep, agree [15:00:17] dhinus: I cleaned up tofuinfratest heat clusters that were in progress, now I'm running the very-safe-and-normal 'delete from cluster where project_id='tofuinfratest';' on the magnum db [15:00:36] that will get you unblocked until the next one [15:07:01] great thanks [15:07:11] let me know when I can try re-running the tofu script [15:09:22] andrewbogott: I received your cloud-announce post twice for some reason, it is duplicated in the archives as well https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.org/ [15:09:49] I sent it from the wrong email first and got a bounce notice [15:09:54] maybe the bounce was a lie [15:10:06] you can try re-running the tofu bits now [15:10:13] thanks, re-running now [15:11:10] re: bounce, it's weird, I see the same email address in both... [15:11:40] I saw it twice too [15:12:14] ¯\_(ツ)_/¯ [15:14:54] hm, sorry for the spam [15:20:48] no worries, I was just curious of understanding what went wrong [15:38:32] andrewbogott: cluster creation is ongoing since 16 mins, which is longer then usual [15:39:07] you can dig down and see what's wrong with 'openstack stack list' and 'openstack stack resource list' [15:39:12] "openstack coe cluster list" shows CREATE_IN_PROGRESS, but "openstack stack list" shows CREATE_FAILED [15:40:19] yeah, magnum takes a while to notice [15:40:43] ok so the difference is expected. the failing resource is kube_masters | 0b478bde-c9e5-492f-82bd-3f80f48782c7 | OS::Heat::ResourceGroup [15:41:23] you can drill down further and probably figure out what specifically failed, might be a port attachment or a VM [15:41:39] with 'stack resource show'? [15:41:53] yeah, and 'resource list' [15:43:01] Deployment exited with non-zero status code: 1 [15:43:59] ah there are nested stacks [15:44:36] yeah, it's a deep hole [15:45:19] "OS::Heat::SoftwareDeployment" failed [15:45:26] in the nested stack [15:46:26] resource show on that one gives a big dump of a shell script [15:46:38] that apparently exited with 1 [15:47:54] so that means actually installing things on the VM failed I guess [15:48:01] That or accessing the VM failed [15:48:21] the last recognizable thing is a curl [15:48:29] but I'm not sure if that failed or something else [15:48:31] OK, so could be a firewall/network thing! [15:48:49] btw here's a minor followup from today's outage: https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1143126 [15:49:43] using -f yaml makes it more readable [15:50:23] +1d the patch [15:52:58] the security group for that VM (tf-infra-test-127-jte4z5zbwbgi-secgroup_kube_master-iwgebk74yj3k) doesn't have any ipv6 rules [15:53:20] it would surprise me if magnum is trying v6 to access it, but it wouldn't surprise me /that/ much. [15:53:26] So maybe the template needs updating to support v6 [15:53:37] oh wait, nm [15:53:45] that VM doesn't have v6 enabled, it's on the v4 network [15:53:52] I assume there's interconnectivity between the two networks. [15:54:30] the curl seems to have timed out after 1 min [15:54:45] but there's also another error in that big dump, that seems unrelated to the curl one: "KeyError: 'pem'" [15:54:45] can you figure out where the curl was running, so we can retry it? [15:55:28] let me try to paste that full thing somewher [15:55:35] * andrewbogott cringes [15:55:51] but there are some certificates, that might or might not be sensitive [15:56:06] you can use this on a cloudcontrol: openstack stack resource show 7d14d567-3ce1-4d77-bf18-8a25dd733a8d master_config_deployment -f yaml [15:57:13] server_id: 341a6608-286f-4557-9517-364473630aa3 < I guess that's where it was running [15:58:47] Oh, it's running the curl on the... [15:58:47] hm [16:00:47] I don't see the timeout, what should I be searching for? [16:01:01] unrelated, the proxy _also_ failed to create with tofu: Unable to create a new web proxy for tofuinfratest.wmcloud.org: Internal Server Error [16:01:45] bah [16:02:43] if you want a silver lining, all other resources were created successfully :) [16:02:48] If you're patient, I'm interested in if magnum is failing the same way every time [17:03:09] welp, I've evoked a new behavior. 40 minutes in and everything is 'in progress' including heat [17:15:48] * dcaro off [17:15:52] cya tomorrow! o/