[07:57:02] taavi: hello! Want me to deploy https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1083589 or do you want more reviews? [08:06:38] XioNoX: I'm happy to just deploy it [08:07:29] taavi: sounds good! [08:22:22] XioNoX: homer is showing a bunch of bgp peers being removed also, ok to commit these? https://phabricator.wikimedia.org/P70591 [08:31:47] taavi: that looks weird, let me check more [08:31:53] * taavi assumes not [08:44:16] ok, I see what's up [08:49:48] topranks: hello! do you know if the bridges setup in Netbox are automated on the ganeti hosts or they need to be done manually? [08:51:16] XioNoX: the puppetdb import script should add them correctly once they have been created on the hosts [08:52:36] topranks: cool! so we need to add "re-run the puppetdb import script" to the ganeti host provisioning workflow [08:53:32] I ran it for 2038, at least all of those need it: https://netbox.wikimedia.org/virtualization/clusters/52/devices/ [08:53:36] yeah, there is the manual step of changing /etc/networking/interfaces on the ganeti host after initial reimage and I think it’s often not run afterwards [08:53:56] did this cause some issue? Or relate to the k8s bgp thing? [08:54:40] topranks: yeah it tries to remove all the BGP sessions to the core routers for the VMs in row C [08:55:22] huh?? [08:55:25] because this only looks for an interface with a bridge https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/homer/deploy/+/refs/heads/master/plugins/wmf-netbox.py#103 [08:56:30] hmm ok didn’t know we were so reliant on that [08:56:45] we can do it better and allocate at provision stage I think [08:57:36] short term we can just update the doc and let moritzm know [08:57:53] longer term automate it more :) [09:03:22] we could also make a small change in Homer code to do an “if” on the primary_ip.connected_endpoint, and work it it’s a bridge or physical [09:04:09] that too yeah [09:04:40] just having one host with the bridges stops prompting for the BGP sessions removals [09:05:34] Morning! [09:07:08] they also have SLAAC IPs (but Netbox properly doesn't import them) https://phabricator.wikimedia.org/T265904 [09:07:10] XioNoX: yeah makes sense. [09:08:13] They shouldn’t have SLAAC IPs though I think, at least for the bridge devs the sysctl is set to disable accept_ra [09:10:44] the puppetdb import shows "2620:0:860:103:d02c:c8ff:feaf:ed68/64: skipping SLAAC IP" [09:11:13] yeah you're right [09:11:56] I think we may need the "ip token" config in /etc/network/interfaces so that it ends up creating the same IP we have assigned statically from autoconf [09:12:15] taavi: I pushed the firewall change to cr1-codfw, let me know how it goes with cr2 if you don't mind [09:12:43] I'll talk to Moritz I think he's doing ganeti2041 today we can experiment with that try to get it right [09:13:34] XioNoX: sure, let me try [09:13:45] cool [09:16:02] homer is suuuuuper slow these days btw [09:20:35] committed to cr2-codfw fine [09:20:56] taavi: on core routers, yeah :( it's because of https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/homer/deploy/+/refs/heads/master/plugins/wmf-netbox.py#115 [09:22:45] once all servers are moved to their per rack vlan, we could get rid of that code path (at least for codfw) [09:23:30] or we could look at replacing it with a graphql query [10:16:57] dcaro: welcome back [10:17:10] please leave your comments on T377467 when you have a minute [10:17:10] T377467: Decision Request - How to do the Cloud VPS VXLAN/IPv6 migration - https://phabricator.wikimedia.org/T377467 [10:23:49] that's going to take some focused time 👀 [11:10:26] * arturo brbr [11:52:34] * dcaro lunch [12:01:16] because DST changes, a bunch of meetings are on weird schedules now [12:01:38] we may want to review them [13:01:36] * arturo brb [13:49:40] I might take the opportunity to write the cookbook for it [13:52:22] I don't think it is worth writing any automation at this point until the tofu-infra repo takes its final form, whatever that is [13:52:39] the last 2 months we have been learning how to use it, shaping and refactoring as we go [13:53:02] so whatever automation may be invalid soon [13:55:26] I'll give it a try, if the automation is not very hard, it's worth more automating it than documenting it [13:58:34] I personally quite like the process of creating a MR and getting a +1, then running "tofu apply" becomes the equivalent of running the cookbook [13:59:01] though there are some things that we might not be able to do in tofu, which means that a cookbook might still be needed [14:03:27] in other words, tofu ~= spicerack, creating the MR ~= pasting the cookbook arguments in IRC for a review, running tofu apply ~= running the cookbook [14:04:05] for things that tofu supports, I would use tofu only. for things that tofu does not support, I would use cookbooks/spicerack only. just my 2c. [14:07:35] I think I agree [14:12:16] If anyone has opinions on https://phabricator.wikimedia.org/T360041 could they be shared? There seems the be the expectation that a small application should be expected to keep data forever, which seems kind of absurd to me, and keeping the data does introduce some issues with pii. Though I'm not really sure how to proceed given the current responses. Opinions on if the data should be kept or removed are welcome [14:17:18] dhinus: there's things in tofu that need more than one change (ex. create project + add pepole + set quotas, ...), until those are managed by tofu, a simple cookbook that creates the patches for you + waits for you to have the patches merged to do the rest of stuff would be really useful imo (and that keeps the single point of entry for any maintenance activity in the cookbooks) [14:18:52] in a very simple world, where every action is juts one patch in the tofu repo, not using cookbooks would be doable, I don't thing we are anywhere near that point [14:19:35] (that would require us writing tofu modules for every action that we want to do, ex. having a `vps project` module that builds the openstack project, adds the users, sets the quotas, ...) [14:26:13] there's also things that will never work with tofu (like reimaging a node, or draining a ceph node), so that also points to having the entry point for maintenance somewhere else [14:47:15] Rook: I see why retaining results forever might have some value (re-running the same query is not always an option), but the pii issue is a big one, and IMHO justifies setting an expiration on results. I also like the idea of some kind of "export" option, if you need to store some data long-term. [14:48:07] Would you see the export option as different from the existing download options? [14:49:46] not sure, maybe the existing options are enough... but if krinkle or others want to submit a patch to save to phab or wiki I would not be against that [14:50:34] I don't like the idea of retaining data within quarry, as to me quarry is not the right place for long-term storage [14:50:37] That would be fine. Feels a little odd to use phabricator for that, but that's outside of my realm [14:51:35] I agree with you that quarry is not the place for such. Do you think we should just override the current opinions in the ticket and start trimming old data or try to get some kind of consensus? [14:52:18] maybe we could start with a 1-year policy to get rid of very old stuff, and slowly work towards 90 days or less? [14:52:45] This kind of reminds me of some intro video that I watched when i started that what are foundation staff for. I think it was framed as being the parent, but was basically being the people to do unpopular things and thus be there to blame for the unpopular (but needed) things [14:55:17] ha! tbh I have yet to find a good metaphor for the wmf/community relationship :) [14:56:57] our technical governance model is weird and pretty much undocumented :-) [14:57:24] It's basically the same thing that my mentor at IBM said many years ago. "Sometimes I'm just here to be someone to blame." [14:57:56] Alrighty, let's make it a year and then see if we can't trim it back to more like 90 days in awhile [14:58:40] i've been blamed in the past for (checks notes) spending my free time working on a feature I personally wanted to use instead of something some other person thought was more important [15:00:18] Rook: fwiw, starting from a year seems totally reasonable to me [15:01:19] 👍 [15:01:40] And congratulations on being the recipient of such a delightfully absurd degree of blame [15:04:01] +1 for starting with a year [15:31:05] dcaro: regarding things that we cannot do with tofu, for example users or quotas, I believe the right thing to do is to add support for them [15:31:14] which should be fairly easy [15:44:56] Hello, dhinus et. al. FYI there is a new spicerack release v8.15.1, I see the cloudcumin hosts are still on v8.8.0. Feel free to update them when you see fit. Let I/F know if you have any question of course. [15:46:03] volans: thanks, I will take care of upgrading cloudcumins! [15:46:23] great, thanks! [16:05:53] volans: I'm getting "ModuleNotFoundError: No module named 'conftool.extensions.dbconfig'" with the new version [16:13:40] fixed with "apt install python3-conftool-dbctl" [16:17:55] I also did an apt full upgrade, I will reboot both cloudcumins to pick up the new kernel [16:18:35] can anyone else log in to labtesthorizon? I'm getting an infinite redirect thing within the keystone sso internals [16:20:13] taavi: same here, infinite redirec [16:20:17] *redirect [16:21:00] taavi: I can use labtesthorizon normally [16:21:55] hmm. now it worked [16:23:28] I was eventually logged in too [16:29:57] something is up with DNS there, I'm having problems resolving the hostnames for most VMs in cloudinfra-codfw1dev :/ [16:41:07] taavi: I was playing with designate stuff for T377740 last weeb [16:41:07] T377740: neutron: clarify why DNS extension is not enabled - https://phabricator.wikimedia.org/T377740 [16:41:11] last week* [16:49:27] dhinus: have you ever tested doing a snapshot and restoring it in lima-kilo? maybe that's a better way than caching/rerunning ansible, setting up the VM, create a snapshot, and restore the clean snapshot [16:49:32] taavi: also, in general, besides my work, designate and rabbitmq have been misbehaving lately, so maybe is related [16:50:33] Raymond_Ndibe: have you tried the snapshotting also? (looking for MAC users, as I can test linux) [16:52:00] * dcaro paged [16:52:01] dcaro: no never tested it, but we discuss it last week, I think somebody tried, maybe blancadesal? [16:52:05] dcaro: no I currently testing the harbor setup with the new harbor_tests container. another sync? [16:52:39] dcaro: is the page cloudvirt1063? if yes you can ignore it [16:52:51] yep, just recreated the silence [16:52:55] I thought I silenced it for longer but maybe the silence expired [16:53:04] arturo: yeah I think this is rabbit misbehaving [16:53:04] also won't making a whole full snapshot defeat the purpose of deleting the VM in the first place? [16:53:05] a few times it seems xd [16:53:08] I would give it another couple weeks [16:53:32] taavi: let me restart a few things [16:53:36] Raymond_Ndibe: if you do the snapshot right after the first install, then you can revert to that, having using that snapshot for a couple weeks as base should be ok [16:54:19] dhinus: ack [16:54:34] dcaro: doesn't taking a snapshot mean the ansible won't ever run again? [16:54:45] unless you run it manually, yes [16:54:51] dhinus: did you just updated cloudcumin servers? I'm getting a weird error using the cookbooks [16:55:00] https://www.irccloud.com/pastebin/aBmsvHGa/ [16:55:11] Raymond_Ndibe: and when you recreate the VM of course [16:55:17] arturo: I did, what's the error? [16:55:19] dhinus: keyholder requires to be rearmed [16:55:24] taavi: For our OpenStack deployment the equivalent of the "it is alway DNS" meme is very much "it is always rabbitmq" [16:55:26] uh right, I'll do it [16:55:41] https://www.irccloud.com/pastebin/zpYahUrj/ [16:56:16] hmm not sure where the permission error is coming from [16:56:37] I guess it is just is unarmed [16:56:50] shouldn't puppet set that permission correctly? [16:57:29] yes I think it's a false alarm [16:57:43] as soon as keyholder arms the keys it should go away [16:57:45] I assume when you arm keholder it modifies the file [16:57:49] doing it right now [16:59:30] all keys armed [16:59:54] dhinus: thanks, works now! [16:59:54] dcaro: that sounds like a bad idea. unless there is a way to select the things we want in the snapshot. A snapshot of the VM basically means taking the VM to a previous checkpoint. to do the things that require us to recreate a VM in the first place, we'd need to do them without using the snapshot (checkpoint) and that defeats the purpose imo. But we can discuss about that somewhere else. [17:00:51] it's not replacing the VM creation, but making "resetting the vm" way faster (you don't have to recreate it, you can just restore a snapshot) [17:00:59] taavi: try now? [17:01:48] * arturo offline [17:01:49] dcaro: I did try snapshotting, but on linux. I wasn't able to make lima's (experimental) snapshotting feature work for me, but creating an image from the diffdisk of a stopped instance using `qemu-img`, then starting another instance from this image (instead of the base debian one) worked [17:02:58] arturo: the APIs started working but some internal names (like enc-1.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud and cloudinfra-db-02.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud) are still NXDOMAINing [17:03:34] blancadesal: interesting, limactl snapshotting works for me, that'd be the 'base image' approach though, probably more portable, though might be slower [17:03:47] dcaro: the page for cloudvirt1063 was still firing in victorops, I resolved it there, and set a silence in Icinga too (icinga is still paging victorops directly) [17:04:05] dhinus: ack, thanks [17:10:18] * dhinus offline [18:10:21] * dcaro off [20:55:37] taavi: I'm sorry :-( rabbitmq in codfw1dev is behaving like 60% of the time it works everytime