[07:47:55] that is quite nice [07:48:13] *and having the ability to easily list the jobs per pipeline is a nice addition ( https://zuul.opendev.org/t/zuul/project/opendev.org/zuul/zuul ) [14:54:35] I went to dig into Nodepool and the drivers it offers ( https://zuul-ci.org/docs/nodepool/latest/configuration.html ) [14:55:00] and I feel there is a gap between the cloud drivers and the static/metastatic ones [14:55:31] how so? [14:55:40] hi James :) [14:56:10] I was rethinking about the conversations we had about arbitrary code running on executor and worker nodes [14:56:32] the executor offers some level of protection via Bubblewrap [14:57:53] with the cloud drivers, the instance can be dropped at the end ensuring any malicious usage is discarded [14:58:08] but on a static/metastatic worker node there is no such mechanism [14:58:19] agreed [14:58:53] so my thought was that could have the ssh daemon inside a vm that is dropped at the end [14:59:07] and on the next connection to port 22 a new vm is started dynamically to answer the ssh request [14:59:14] but that sounds like reinventing some wheel [15:00:20] yeah, at some point, it starts to look like you're creating a cloud orchestration system. [15:00:41] other thoughts I had was for Nodepool to drive a docker daemon that would spawn worker containers [15:00:43] anyway yeah [15:01:03] I probably not want to pretend I can compete with AWS/Azure/GCE etc [15:01:53] I somehow had a different model of the static driver [15:02:31] the second thing sounds like k8s -- but i will say that it's pretty easy to run a simple standalone k8s, so if the alternative is "write a doemon that spawns containers", then i'd say you should consider just running microk8s or kind or something simple like that. [15:02:44] (but obviously, that only works if the workloads can run in containers) [15:03:06] most things run in containers [15:04:08] would it be worthwhile to check in on what the previous issues with openstack were, and see if they either have been resolved in newer versions, or if perhaps they can be resolved? [15:06:26] I guess so [15:06:50] there was some discussion about using "Magnum" to spin/provide a kubernetes cluster [15:07:08] that is an unknown area to me [15:07:30] but I think the advantage is that Nodepool would not interact with our openstack cluster [15:07:41] (which serves multiple other projects beside CI) [15:09:45] so yeah good point corvus, I will start digging [15:11:14] I'll ask in #wikimedia-cloud-admin [16:32:42] after chatting a bit with Andrew Bogott (who was around and assisted when we used Nodepool) [16:33:58] so in short, Nodepool was triggering limits here and there (we have bunch of tasks in our Phab such as Rabiitmq being overloaded https://phabricator.wikimedia.org/T170492 ) [16:34:14] ten years later, the stack is vastly different nowadays [16:34:45] someone yesterday mentioned Magnum / OpenTofu [16:35:15] we already use those two for some of our projects (namely Paws and Quary), so it is known and they are even doing an upgraded this week \o/ [16:35:47] and Magnum can then be used to setup a K8s cluster which we could use for Nodepool. I guess we can do a proof of concept [16:36:11] sounds promising! [16:36:32] I imagine Nodepool then spin up pod/containers that are listening for ssh connections coming from Executor [16:39:17] there's a very different code path for containers... we don't use ssh, but instead use kubectl exec. streaming logs are done with port forwarding. so the pods don't need to run ssh. [16:39:53] (that's a slight simplification, it's more like we use the "kubectl" connection plugin for ansible, which takes care of doing the exec) [16:41:15] the base job playbooks are a little different because of that (because we can't "git push" to the remote node), but most of the job body can be similar. [16:42:04] regarding network connections: the nodepool and executor pods need to be able to connect to the kubernetes api [16:47:12] so the Executor runs an ansible playbook that has something like "K8s exec + param"? [16:47:49] I think I start having my mental model adjusted :] [17:36:16] dduvall and I had a quick chat. We can do some investigations via https://zuul-dev.wmcloud.org/ (which was created on our OpenStack cluster with docker compose, that is a sandbox [17:36:16] it has a static node. So maybe we can try https://zuul-ci.org/docs/nodepool/latest/static.html#attr-providers.[static].pools.nodes.max-parallel-jobs [17:37:15] but if I understood properly having multiple jobs on a single static node causes a race condition with the temporary ssh key pair being added by two builds to the same ~/.ssh/authorized_keys, and eventually one get removed [17:37:41] (which leaves me to wonder why there is a max-parallel-jobs option) [17:37:51] but multiple unix users might do [17:37:58] --- [17:38:24] and maybe we don't need to use a k8s driver at start and static nodes is good enough [17:38:53] I am off. It is nearly 8pm. Tomorrow is an holiday here in France, but I will attempt to attend the sync meeting nonetheless [17:44:45] hashar: instead of using max-parallel-jobs (which is only appropriate in a very trusted environment), if you need to use the static driver, then create multiple users, and register each ip-user as its own static node. that resolves the ssh key conflict [17:45:16] +1 [17:45:43] and the ansible commands running on the static workers are not isolated using bubblewrap are they? [17:45:57] the main ansible playbooks for most jobs don't need to change because they just say "shell: echo hi"; and ansible knows to "ssh" in the case of a VM or static node, or "kubectl exec" in the case of a pod. [17:46:30] correct, not isolated. whatever happens on that node is entirely up to the job via the playbook, and zuul doesn't know or care [22:25:52] yeah, I was fretting today about the static nodes as well and hash.ar asked a lot of my questions already. For context, in our existing system, we have a job to set up some docker volumes and run zuul cloner, then at the end of the job we sigterm the container and wipe the workspace. So any code that's provided from users is only exercised within the docker container. Other than that, the [22:25:54] job code is defined in our integration/config. So we have static nodes, but we throw away anything users could have touched at the end of the job (since if they spawned a process or whatever, it would be in the container's context and any files generated are ephemeral). In the new setup, with static nodes, seems like we'd be running arbitrary code from the internet on the static node itself [22:25:56] and the only protection is the unix user/group protections, seems like it's pretty easy to taint a node. It seems like the options for us would be (a) find a way to throw away nodes either with nodepool or spinning up a k3s or something and using the k8s executor or (b) ...and I'm unsure about this one...we could exclude all the configuration items from untrusted projects (effectively: [22:25:58] don't run untrusted ansible) [22:31:42] that's a good summary. i think openstack or k8s are the best approaches: that's what opendev does (well, openstack at least, but k8s is equivalent here for this purpose). [22:33:01] and i think it's a good idea to try to be similar there, since opendev and wikimedia are dealing with similar issues [22:35:58] excluding dynamic configuration is possible, and along with the containerization, i think it could get you something equivalent to your current system. but you lose a lot of the benefits of zuul v3+, and users may be quite unhappy (and it will be more work for CI-specialists to manage those jobs) [22:37:17] we've spent a lot of time thinking about how to make this safe for public use, so the closer we can hew to the trail already blazed, the better from a security standpoint i think. [22:37:58] so i advocate for vms or k8s if possible [23:00:25] that makes sense. I think we'll need a spike to investigate our options and come to a decision on the execution model that fits. I'd lean toward not fighting the system where possible :)