[14:13:11] andrewbogott: I'm not sure I get your comment in https://phabricator.wikimedia.org/T422801#11981516, are those fedora hosts puppetized? [14:13:33] no [14:13:43] those are magnum? [14:13:46] yeah [14:13:49] but they're just examples. [14:14:07] I think that cumin needs to know in advance what username to use when contacting VMs and I don't know how it will know. [14:14:13] and they live in the same dns namespace of the other hosts and don't have a common hostname prefix/suffix? [14:14:34] yes [14:14:38] that's bad :/ [14:14:55] well... we might be able to predict the host names most of the time [14:15:01] Assuming that magnum doesn't change them :) [14:15:14] I guess having something that mostly works will still be better than what we have now [14:15:16] can we force a prefix/suffix? [14:15:27] via magnum config I mean [14:15:34] good question, that would help [14:16:02] the other thing is that sometimes the distro for a given workload changes (e.g. magnum VMs will soon switch from fc to ubuntu or maybe debian) [14:16:19] same hostname? [14:16:22] if we can coerce them to have predictable hostnames when that happens then we're good, but I'm not 100% sure [14:16:45] I [14:17:21] maybe it could be easier to instead setup the root user at provisioning time? [14:17:28] I'll read some more config docs. I think I was hung up on wanting an all-purpose solution; if we just add a bunch of special-case lines to the ssh config (and don't mind doing more in the future) then this might be possible-ish. [14:18:55] for ssh config, does it have to be prefix/suffix or can we match something in the /middle/ of the hostname? [14:20:32] https://man7.org/linux/man-pages/man5/ssh_config.5.html#PATTERNS [14:21:46] to be tested but *foo*.*.eqiad1.wikimedia.cloud should potentially work [14:22:36] ok, so we could match on *-node-* although then we'll trip over users who use the same naming scheme [14:22:49] I guess I need to decide if I'm trying to make this work for all magnum workers or just paws+magnum [14:23:20] node is too generic, has to be something we inject [14:23:32] yeah, would be a lot better [14:25:52] hm, the new driver uses a different naming scheme, examples "kube-9d4oa-5w4q6-qg7nv" and "kube-9d4oa-default-worker-n6q7c-cqwtb-bg8pt" [14:26:01] But again, don't think that kube-* is specific enough [14:27:46] yep [14:27:58] how does it provision stuff? does it run cloud-init? [14:29:42] are those VMs tagged in any way in Nova? [14:31:32] at the moment it's using ignition and not cloud-init [14:31:44] good question about nova tags, let's see... [14:34:51] no special tags. But of course we can identify them based on the image name they're based off of (which I have been naming "*for-magnum" lately) [14:35:27] There's a real maze of tasks about arbitrary worker naming, I haven't gotten to the end yet to tell what was decided :) [14:36:53] I dont' see much in the example config beside specifying a specific name [14:37:05] but I think it's per cluster, not a global prefix [14:37:29] for hsotnames, this would likely be a cluster-api setting probably and not a magnum setting. [14:37:47] since the cluster-api driver creates different hostnames, I assume that magnum is not specifying them [14:37:58] sorry, this is confusing because we stand athwart two different backends right now [14:38:23] I need to run a test and make sure that the root user is actually disabled with the latest driver -- that will take a few minutes. [14:39:27] ok, the opposite approach could be to have a script that queries openstack APIs and generates the ssh config bits for those hosts with explicit hostnames, it could run every hour or so [14:43:18] ok yeah, these worker nodes use 'debian@' [14:44:20] a simpler path forward is to just fix this for /our/ magnum clusters: paws and quarry. Those will have predictable hostnames, and we can just treat third-party clusters as cuminless black boxes. [14:44:31] Not ideal but still a big improvement on what we have today. [14:50:40] can I get a +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1297114? [14:51:16] I'm still trying to lower the binlog write volume (see task), but we need to recoup some space and this seems an easy way [14:52:02] done [14:52:07] thanks! [19:20:28] Amir1: should we make you a vps project to run your tiff compressor? If this is going to be a forever thing then it should probably live in prod k8s someplace but if you're also blocking the upload of uncompressed tiffs then we can run it in cloud-vps for as long as it takes and then shut down. [19:20:56] It's also possible to grant extra-big ram quotas to toolforge tools, within reason. [19:42:40] andrewbogott: I was actually about to create a ticket for requesting a cloud vps project for it but *gestures at the state of the world and wikimedia* [19:43:00] one of my top priority things to do today [19:43:28] there is a lot happening :) If you open a ticket and then bug me again in the morning we should get you set up (might not have a second +1 until then) [19:44:03] Sure. Thanks [21:31:39] taavi, a question for you to answer whenever: I need to add a security group rule to every Trove VM for cumin access. I can either do that by making a little timer that just polls and adds them, /or/ I can add a second management network to the Trove project, attach that and a new IP to each trove VM and add the security group to that. Trove does not implement automatic security groups for the primary interface, only for [21:31:40] management. [21:32:00] andrewbogott: for magnum clusters, isn't it possible that releng CI might move back and use internal clusters for their needs? and in that case, what would we do w/ cumin? [21:32:05] I'm leaning poll-and-add because that's something I can whip up in no time and it applies retroactively. But here's your chance to say you hate that and will help with the latter instead :) [21:32:48] bliviero: When you say 'internal clusters' you're talking about them moving /to/ cloud-vps, or away from cloud-vps? [21:35:44] If you're asking about how my paws-specific cumin solution would help with CI clusters... the answer is that as long as we're talking about a handful of long-lived use cases in known projects, we can probably special-case each of them. It's inelegant, but I don't think there's any elegant solution for cumin to know what kind of VM a given VM is beforehand. [21:36:02] But yeah, if we special-case known uses then that will always leave some random magnum workers unreachable. [21:37:58] andrewbogott: moving /to/ [21:38:13] ok, then everything else I said applies :) [21:38:46] andrewbogott: is there a way to interrogate some systems that would divulge the names of the active VMs in that "pool"? is that technically what you are doing? [21:39:55] kind of. Cumin already has an openstack backend that aggregates things based on openstack tenant. So that could be expanded to know other things about kinds of VMs. [21:40:35] But historically Riccardo has wanted cumin to remain a universal tool without a lot of knowledge about our particular openstack implementation. I don't think we can really keep with that and also add a bunch of special querying. [21:41:06] But we might be able to do something generic-ish using VM tags outside of the VM name. [21:41:34] I'm trying to avoid that (and adding distinctive naming strings) because the engine that actually creates the VMs is a bit of a black box and I'm happy with it staying that way :) [21:43:31] Basically: Everything is possible if you're willing to hook into third-party code but then we pay the maintenance price for that for ages. I'm still hoping for a solution that uses existing configs. [22:25:15] andrewbogott: Sorry it took long but it's here T428102 [22:25:15] T428102: Request creation of tiff compression VPS project - https://phabricator.wikimedia.org/T428102