[06:13:18] greetings [06:26:22] I'm trying to access cvn-nfs-1.cvs via vm_console cookbook at looks like the cookbook is currently broken on cloudcumin hosts [06:26:51] https://phabricator.wikimedia.org/P83849 [06:27:28] ah yes of course, keyholder not armed [06:29:37] {{done}} [07:20:02] doh, my bad godog, sorry about that, did you arm it in both? [07:20:13] volans: I did yeah, no worries [07:20:27] totally forgot about it [07:21:22] and missed the alert on -operations [07:23:55] https://gerrit.wikimedia.org/r/c/operations/alerts/+/1195945/ [07:24:35] nice [07:25:09] actually I think team: "{{$labels.team}}" is not needed [07:25:28] i.e. team will be part of the resulting expression [07:26:34] huh, indeed. I just copied that from some other rule :D [07:28:56] heheh yeah I mean it is understandable [07:29:23] I was on the fence on adding a check to ci because it isn't really an error [07:29:36] and warnings are useless IMHO, so yeah [07:56:41] taavi: so, based on yesteday's meeting the right place to start is in the toolforge-deploy repo duplicating the components/logging to components/tracing and adjusting -tools with -tracing more or less? [08:01:35] volans: yep, you need to adjust https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/blob/main/toolsbeta/shared.tf?ref_type=heads#L53 to provision the s3 buckets, and then either duplicate the logging component or adjust that to deploy a new instance, both are possible [08:01:59] great, thx, the bits I'm not really sure if they are needed or not are those in values/alloy/common.yaml.gotmpl [08:04:52] mmh how does that single line creates the 3 buckets I see in the config? tools-loki-logging-{chunks,ruler,admin} ? [08:06:20] do we have a preference for new component vs additional instance in the logging component? [08:19:50] morning! [08:20:53] volans: thereäs a bit of indirection in loki but the individual buckets are managed in modules/shared/loki.tf [08:21:16] i would maybe start in the logging component for now, and then move it out if it becomes too big or different compared to the other deployment [08:22:52] sounds good, I'll give it a try (currently rebuilding my lima-kilo) [08:27:40] I can test it in lima-kilo, right? [08:28:17] yep [08:31:23] the credentials will be the same (e.g. lokiObjectStorageAccessKey) or do I need to duplicate them too? [08:35:29] same, as it's using the same project [08:38:39] right, per openstack project users in ceph, you told me that :D [08:43:00] what are you talking about? [08:43:10] (the components/tracing) [08:45:59] dcaro: TL;DR in was discussed in the infra meeting that it would be nicer to store the corss-tools tracing data (both nfs and network) in a loki instance to have both graphs but also an easy way to dig into the actual usage when needed [08:46:41] it was deemed better not to use the existing loki for tools logging and hence create an additional loki instance for this (resource wise shouldn't require too much) [08:48:24] will that duplicate all the loki components? [08:48:45] (note that we are still figuring out how loki works itself, it's not yet fully functional) [08:49:01] ex. https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/997 [08:49:24] T402736 [08:49:25] T402736: toolforge logs appears to suffer from intermittent latency - https://phabricator.wikimedia.org/T402736 [08:49:34] T400917 [08:49:35] T400917: [jobs-api] Allow customizing time to request Loki logs for - https://phabricator.wikimedia.org/T400917 [08:49:50] and when discussing where to put this instance (VMs in cloudVPS or within toolforge k8s) it was pointed out that we do already have the machinery to setup loki and so it would be quicker/easier to have another small instance for that [08:50:37] only the loki instance will be within k8s toolforge, the log send part (most likely alloy) will be on the workers [08:50:55] so setup in a different way I think, have not yet investigated that part [08:51:17] currently we deploy alloy from within k8s on every worker [08:57:51] ack [08:58:35] note that the main namespace is `loki` so it will have to use a different one [08:59:07] (there's a bunch of network policies also related to it that will have to be duplicated too, and double checked to make sure they don't collide) [09:00:44] take also into account that there's currently a bug in the loki helm chart that prevents the single-binary deployment (the one in lima-kilo), from working out of the box, and needs an extra network policy (like the current loki has, will need duplicating + adapting to the new names) [09:00:51] yes I noticed both, I don't think we need the alloy policies right now as those if I understand it correctly are related to the intra-k8s communication that we'll not have [09:02:57] you'll need one to shut all communication then from within k8s at least [09:04:38] there is also the authn/z part to figure out as loki doesn't have it and we might not be able to manage it with just network rules (allow ingestion from the host but disable from within k8s) [09:05:29] yep, I was thinking on how to prevent others from using whichever ingress you set up for ingestion [09:06:12] probably reverse proxy with basic auth or cert [09:08:19] on the k8s ingress side? [09:08:52] maybe as a sidecar, exposing only that one [09:16:27] I have a side quest(ion) re: tools nfs if that's ok, I don't want to derail the loki conversation [09:17:56] I'm cautiosly optimistic re: stuck nfs workers, however tools-nfs-3 flavor is the smallest because I mis-provisioned it. I'd like to resize the VM to give it at least 2x cpu and mem, maybe 30 min downtime total. ok to just do it or should I grab a toolforge maint window and announce it ? [09:18:47] and my expectations is that nfs clients will recover by themselves once the nfs server comes back [09:29:34] I'd say better to announce it, even if it's not with a whole week, but one day to the next [09:30:00] something like 'after the last one, we still need to do a shorter intervention....' [09:30:18] SGTM, I'll send the announcement now for tomorrow morning, thank you dcaro [09:33:45] btw. memory usage is still higher than before the nfs move [09:33:49] https://usercontent.irccloud-cdn.com/file/rikKkDhv/image.png [09:35:21] I have the impression that it's the other way around though [09:35:28] (as in that might be free memory, not used) [09:37:44] I'm rebooting cloudgws for https://phabricator.wikimedia.org/T407110 [09:39:11] ack [09:39:15] yep that was free memory 🤦‍♂️ [09:40:54] lol [09:41:01] taavi: good luck [09:41:46] well, I would be if the cookbook would work or Gerrit would be up so I could send the fix for review [09:42:15] you can run from your laptop if you want [09:42:52] it's not a very long-running thing, so unlikely to have network issues [09:43:00] I could if I wanted to continuously tap my yubikey for every SSH invocation, easier to wait for gerrit to come back up [09:43:09] xd [09:43:09] :D [09:44:08] homer-typing-bird.gif [09:44:24] lol [09:48:27] another question for collective you, I'd like to poke cinder via its CLI and I'm currently failing to feed it the correct credentials/arguments https://phabricator.wikimedia.org/T406688#11264300 I guess I'm after sth like "wmcs-cinder" that would do the correct setup similar to wmcs-openstack [09:48:40] is there something like that or what variables should I pass ? [09:52:01] openstack is trying to merge all the project-specific CLIs to the main openstack one, so in theory 'cinder attachment-list' should have been replaced by 'openstack volume attachment list'.. but that's empty? [09:53:43] ah mhh ok that could be it, I was hoping to be able to get a list of currently 'reserved' attachments for that volume, but maybe not [09:54:39] i.e. the previous message in that task has the response from cinder showing a bunch of reserved attachments which I'm not able to list otherwise [09:55:51] but yes it'd make sense if cinder and openstack cli are basically equivalent [09:57:44] I've created https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/92 for the buckets [10:00:56] the failures to install the tofu provider are T405742 if you'd like a good riddle [10:00:57] T405742: tofu-provisioning: Failed to install provider - https://phabricator.wikimedia.org/T405742 [10:01:29] dhinus has a patch for that already :-) [10:02:15] yep and I think the package is now available [10:02:31] so we can probably merge this one? https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/71 [10:02:37] yep, just approved it [10:02:42] thanks! [10:03:25] now I have to rebuild the docker image [10:33:43] * dcaro lunch [10:43:58] dhinus: did you manage to build it? [10:44:38] taavi: sorry been sidetracked by another thing, will do it shortly! [10:49:41] * taavi now rebooting codfw1dev cloudgws for real [11:01:03] rebooting eqiad1 cloudgws [11:02:55] taavi: https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/72/diffs [11:03:00] I wonder if we should just use :latest? [11:03:55] +1d [11:05:17] godog: I have a feeling T407206 might be related to the NFS server switchover [11:05:38] in particular, the nfs.svc.toolforge.org proxy still points to tools-nfs-2 [11:15:32] tofu-provisioning is now using the new CI image, but the problem is not entirely solved: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/jobs/645769 [11:16:25] another job worked fine, but this time it failed with tofu.wmcloud.org as well with certificate errors [11:17:55] hm, is the new image missing `ca-certificates` somehow? [11:20:40] ah that would explain it, maybe trixie needs an explicit apt install? [11:20:55] but then one job succeeded? checking [11:21:12] * taavi rebooting eqiad1 cloudwebs [11:21:46] the job that succeeded was still using the old image [11:21:51] * dhinus updates the image again [11:29:42] I'm doing some checks, ca-certificates is not installed in the new image, but it should come as a curl dependency? [11:30:30] as a strict Depends:, or as Recommends:/Suggests:? [11:31:36] interestingly, it looks like a strict depends in "debian:trixie" but a recommends in docker-registry.wikimedia.org/trixie [11:32:44] that's not a thing (https://packages.debian.org/trixie/libcurl4t64 says it's Recommends:) [11:33:06] what's probably different is that the wmf images don't install recommends by default [11:34:45] hmm maybe that's the reason, but that seems valid for the bookworm image as well [11:34:53] so I'm not sure how it worked before [11:41:48] taavi: thank you for the heads up, looking [11:42:55] TIL nfs.svc.toolforge.org [11:46:00] fwiw that proxy pre-dates the cloud-private network which allows cloudcontrols talk to the nfs hosts directly, so the proxy could be removed now (T392794) [11:46:00] T392794: maintain-dbusers: Use cloud-private to talk to NFS servers instead of proxies - https://phabricator.wikimedia.org/T392794 [11:46:47] hah, interesting ok thank you taavi [11:47:26] taavi: to verify T407206 I suppose I can launch maintain-dbusers again on cloudcontrol1007 and observe success ? [11:47:28] T407206: replica.my.cnf missing for tool glamspore - https://phabricator.wikimedia.org/T407206 [11:48:33] godog: it runs as a systemd service, you should see it doing things in the log [11:49:45] Oct 14 11:44:58 cloudcontrol1007 maintain-dbusers[2931113]: INFO [root._populate_new_account:718] Wrote replica.my.cnf for tool tools.glamspore [11:49:48] indeed [11:49:51] you will need a similar thing for toolsbeta [11:50:08] ack, fixing [11:50:47] not seeing any nfs-related webproxy for toolsbeta [11:50:51] hmm, I thought we had some monitoring for maintain-dbusers [11:50:59] might be going to the nfs directly? [11:51:49] iirc toolsbeta just doesn't have wiki replicas access? [11:53:06] oh, yep [11:53:18] or toolsdb :/ [11:54:17] godog: nothing to fix then : [11:54:19] :) [11:54:51] not sure if \o/ or /o\ [11:55:02] |o| [11:55:11] xd [11:55:13] that's better than the -o- I was going for [11:55:30] lol [11:57:16] there's T384591 where we started thinking on making toolsdb a toolforge-api kind of service, that should help making it reproducible in more than one env, ex. lima-kilo [11:57:16] T384591: [dbaas,toolsdb] Add support for management of toolsdb databases within toolforge - https://phabricator.wikimedia.org/T384591 [12:05:04] taavi: https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/73 [12:05:25] ship it [12:06:05] taavi: I tried understanding how it got installed in the old image but I'm still not sure [12:12:09] and one more https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/74 [12:16:01] lgtm [12:21:10] taavi: any more thoughts on https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/997 ? [12:21:52] i have not yet had the chance to read through that much text and data [12:22:54] summary is, even without per-tool limits, a misbehaving pod will get only 1/Nth of the rate, and will not exhaust any memory on alloy side, just take very long to catch up with the logs [12:23:12] N being the number of pods in that worker [12:23:36] hmm. so that would mean that the effective per-tool rate limit depends on the what the other tools on that pod happen to be doing? [12:24:31] yep, with an ensured 1/Nth at least [12:24:47] up to full rate, if no other tools are sending logs [12:25:11] and N is how much in practice? about 50 iirc? [12:25:13] (similar to cpu requests/limits currently) [12:25:22] I'd say more like 1/20th [12:25:28] but have not checked overall [12:26:21] and we're back to square one, Client.Timeout :( https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/jobs/645842 [12:26:41] anyway, main thing I care about is that the infrastructure is not going to be overloaded, so if you're fine dealing with the potential support load from limits that differ in practice then I think I'm okay with that [12:29:01] there's between 20-40 pods per worker, so 1/20-40th [12:29:31] not sure what you mean with limits that differ in practice [12:30:52] updated the limits to that 40x instead of just 10x [13:09:19] what's the correct way to see in lima-kilo changes I have locally in the toolforge-deploy repo? [13:10:09] from the shell in lima-kilo [13:10:15] you can go to ~/toolforge-deploy [13:10:28] git fetch && git reset --hard or any changes you want [13:10:43] then, from that dir, ./deploy.sh [13:11:16] that's for changes directly to toolforge-deploy, if you want to test a specific MR for a specific component (ex. components-api), then there's toolforge_deploy_mr.py [13:11:32] (most toolforge scripts and utils start with `toolforge`, so you can ) [13:12:35] you can run the functional tests with `toolforge_run_functional_tests.sh` (the full suite takes ~20min, you can filter by component, or test directly if you prefer, see `--help`) [13:14:17] great, thanks [13:18:48] godog: prometheus-node-pinger reports that connectivity is working on cloudceph1050; want to try pooling a drive before you go tonight? [13:22:19] andrewbogott: yes for sure, I'll likely can in an hour so [13:22:31] great [13:23:06] I have reworked the single/double nic mechanism a little in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1194967 [13:23:37] plus related patches in the same topic to get the 'interface' module to DTRT [13:24:11] seems good [13:25:27] I was assuming we'd only do the flip on reimage but your way certainly allows us to change our mind faster :) [13:26:23] not sure I'm following re: reimage [13:27:06] I'm working on the assumption that moving single/double nic does not require a reimage [13:27:33] If I understand correctly: the majority of your changes are to support graceful switching back and forth between single- and double-nic on a running host [13:27:53] which seems better than reimaging any time we switch [13:28:23] yes indeed, roll the change out with no reimage [13:34:13] andrewbogott: you merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1195773 before I could comment you've made the list no longer be properly sorted like the comment says [13:34:28] bah, ok, I'll re-sort [13:35:46] oh, actually, my addition is (relatively) sorted but the list wasn't sorted before? prometheus comes between pdns and pybal doesn't it? [13:36:59] I do not see either pdns or pybal on that diff [13:39:46] you're right, it's Bullseye that was always out of order [13:40:13] anyway, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1196070 [14:10:58] andrewbogott: do you think https://gerrit.wikimedia.org/r/c/operations/puppet/+/875899 needs a cloud-announce@ post to go with it? (and sorry moritzm I lost track of that after saying I'd merge it :/) [14:11:40] Yes, I think it's worth emailing. [14:11:59] and maybe a callout on https://wikitech.wikimedia.org/wiki/Help:Accessing_Cloud_VPS_instances ? [14:13:37] agreed, I don't think it will break anyone's setup, but an explicit announcement won't hurt [14:18:30] https://etherpad.wikimedia.org/p/cloudvps-no-agent-forwarding [14:27:31] lgtm [14:37:44] does ./deploy.sh in lima-kilo logs somewhere something more useful than STDOUT/ERR? [14:40:52] not really, it's helmfile underneath if that helps [14:42:07] yeah that was clear that was helmfile :D [14:43:09] you can try running with `bash -x` [14:43:13] if it's the bash part [14:43:24] if not, you might be able to tweak the script and add the debug to helmfile [14:43:27] nah it's the helm part I see we support HELMFILE_OPTIONS [14:43:46] trying passing --debug to helmfile [14:43:50] +1 [14:44:12] that usually gives you the helm commands, that then you can try running that helm command yourself to get nicer errors [14:44:39] it's clearly something in my patch, but unclear what [14:44:53] the error is just: * timed out waiting for the condition [14:45:18] that usually means that it's waiting for a pod to come up, or a replicaset to start everyting [14:45:26] you can try using `k9s` to check what's going on [14:45:34] (or kubectl if you prefer plain cli) [14:49:44] ack thx [15:56:11] * dcaro off [16:04:26] andrewbogott: I ran out of time for today, we can put 1050 in service tomorrow tho [16:04:37] ok [16:11:04] taavi: one mystery is solved, using "podman image history" I verified that the old tofu-provisioning image had extra packages installed with "apt-get", so it was built with a Dockerfile slightly different from the one in git [16:11:21] that explains why ca-certificates was present in the old image, not in the new one [16:11:30] *but not [16:11:33] dhinus: to me that seems to raise more questions than it solves [16:11:53] ha well, I'm personally satisfied :) [16:12:24] I'm back to figuring out why curl is sometimes still failing even on the new image [16:12:31] but much less frequently than on the old one [16:12:33] good luck :-) [16:12:36] :) [16:12:58] related, I found one minor issue in the pipeline https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/75 [16:13:47] that makes the formatting part run twice, but I guess that's ok if the alternative is yet another job [16:14:16] ah you're right, didn't think of it [16:14:41] but probably ok... otherwise I could add "tofu fmt modules/" in parallel [16:15:49] but that would be another job yes [16:40:15] some progress: using "--net=host" fixes the issue on all images [16:40:26] T405742 [16:40:26] T405742: tofu-provisioning: Failed to install provider - https://phabricator.wikimedia.org/T405742 [16:41:02] I'll stop here for today :) [16:41:08] * dhinus off [16:55:58] komla, akosiaris, I've created a wiki page for the 2025 cloud-vps project purge: https://wikitech.wikimedia.org/wiki/News/2025_Cloud_VPS_Purge#Projects_and_Instances [16:56:22] akosiaris, please edit with the wording you want up top. komla, want me to send the announcement email or are you happy to take that on? [16:57:28] andrewbogott: should that include a warning about any bullseye VMs needing updates like with previous os deprecation iterations? [16:58:09] hmmm I guess bullseye is being deprecated isn't it [16:58:10] So probably [16:58:15] I'll update the dump [16:58:46] actually, huh, how did all those warnings about buster get in the older page? [17:09:31] you ran some script and then they appeared there :D [17:10:02] huh