[07:49:12] greetings [08:29:38] morning! [08:51:06] hola [08:59:56] hello [09:18:53] dcaro: would components/infra-tracing/ and namespace: infra-tracing-loki work? [09:19:34] that sounds great yep [09:21:05] ok, sending patches with the new name [09:47:50] FYI I'm going ahead with single nic for cloudcephosd1049 shortly https://gerrit.wikimedia.org/r/c/operations/puppet/+/1203384?usp=dashboard [10:00:17] godog: ack 👍, the other one went smooth right? [10:00:23] dcaro: yes correct [10:00:50] ok {{done}} [10:07:06] also went as expected, i.e. no impact afaics [10:07:19] nice! [10:07:50] 💃 [10:08:53] yeah! good times \o/ [10:33:15] quick review for the name change of loki tracing: https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-admission/-/merge_requests/32 [10:33:43] +1 [10:33:51] <3 thx [10:34:39] and then I need to deploy it like last time correct? [10:34:49] cookbook wmcs.toolforge.component.deploy --cluster-name toolsbeta --component ingress-admission [10:35:13] yes [10:35:18] and merge it after is deployed [10:35:22] (that always confuses me) [10:38:30] volans: you merge that first, that will create an MR in toolforge-deploy bumping the version of ingress-admission chart, then you use the cookbook to deploy [10:39:05] ok, that's the bit I forgot, thx [10:39:10] there's a diagram in https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy [10:51:11] ok, re-read the readme, and the cookbook knows how to pick the right MR based on tagging/ [10:51:14] ? [10:51:25] yep [10:51:30] and reports back with the test results [10:51:35] (in the MR) [10:51:53] and if there are more than one bump chart MRs? [10:52:13] it's based on the branch name, for each components it uses bump_ [10:52:42] ok [10:53:01] if you merge more than one mr in a component, the toolforge-deploy branch will have a "bumping from 0.x.y to 0.x.y+2", it will accumulate all the version upgrades in one branch [10:53:13] ack [10:53:42] mmmh why is it bumping also jobs-api and logs-api? [10:55:42] it should not, you mean that the version table shows differences on those too? [10:55:55] for toolsbeta, yes [10:55:59] | jobs-api | chart | jobs-api | jobs-api-0.0.454-20251117174820-34492113 | toolforge-deploy has jobs-api-0.0.452-20251110182401-c5c5c1c0 | [10:56:02] | logs-api | chart | logs-api | logs-api-0.0.7-20251117174451-6f8660fc | toolforge-deploy has logs-api-0.0.6-20251103173901-dee61950 | [10:56:16] sorry, last line wrong paste [10:56:16] | jobs-api | chart | jobs-api | jobs-api-0.0.454-20251117174820-34492113 | toolforge-deploy has jobs-api-0.0.452-20251110182401-c5c5c1c0 | [10:56:29] ah no correct, [10:56:46] it seems that toolforge-deploy clone it used has an old version of jobs-api, looking [10:57:24] the version seems correct in the gitlab code https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/blob/main/components/jobs-api/values/toolsbeta.yaml?ref_type=heads [10:57:34] maybe the local clone it used is not updated correctly [10:57:38] looking [10:57:59] is it possible that bump has not been deployed to toolsbeta? [10:59:18] what does it say about registry-admission? [11:00:59] volans: it deployed the right version `ingress-admission-0.0.72-20251118104433-d892c480`, from the version table [11:01:14] ingress-admission-0.0.72-20251118104433-d892c480 <- ingress-admission-0.0.71-20251117175543-3e629bb9 [11:01:27] yes but I was not expecting the diff for the other two [11:01:51] results in ingress-admission-0.0.71-20251117175543-3e629bb9 [11:01:57] https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1082#note_175578 [11:02:17] * volans arguing with his clipboard today [11:03:13] yep [11:03:16] looking into that [11:03:35] the one on the left is the installed version, the one on the right is the version in the clone of toolforge-deploy it's using to deploy [11:06:55] we should stop using my user for this xd [11:07:14] eheheh indeed [11:07:29] toolsbeta looks ok to me :/ [11:07:40] https://www.irccloud.com/pastebin/xGswqBqd/ [11:07:46] but it's easier to blame everything on you this way :-P [11:10:14] dcaro: if I run utils/toolforge_get_versions.sh on the tools-bastion-14 I get the same diff for the 3 components [11:10:35] what user are you using? [11:10:51] depends on the user? [11:11:01] no, but depends on the clone of the repo, might not be up to date [11:11:16] it uses the clone of toolforge-deploy repo to check the version that's there [11:11:26] and compares it with the one given by helm/apt [11:11:30] ahhh ok [11:12:13] for example my user has a repo that's outdated now [11:12:15] https://www.irccloud.com/pastebin/Scdciv2M/ [11:12:41] with a fresh clone I get: [11:12:42] | ingress-nginx-gen2 | chart | ingress-nginx-gen2 | ingress-nginx-4.11.5 | toolforge-deploy has ingress-nginx-4.13.3 | [11:12:42] but the cookbook should checkout the branch first, getting the latest revision (well, the one from that branch) [11:13:31] toolsbeta or tools? [11:13:38] tools-bastion-14 [11:13:41] tolls [11:13:42] *tools [11:13:43] hmm [11:13:48] https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/blob/main/components/ingress-nginx/values/tools.yaml?ref_type=heads [11:14:07] shows `chartVersion: 4.11.5`, maybe it's getting the version from the wrong values file? [11:14:19] what's the path to your clone? [11:14:28] /home/volans/toolforge-deploy [11:14:52] I get the same yep [11:15:00] https://www.irccloud.com/pastebin/HiHjc4pF/ [11:15:02] looking [11:15:37] it's using local.yaml [11:15:37] +++ grep chartVersion /home/dcaro/toolforge-deploy/components/jobs-emailer/values/local.yaml [11:15:43] that's not correct :/ [11:16:40] to be fair, we have never had a different version in different envs [11:17:31] ok, so the diff for ingress-nginx-gen2 is expected and an issue of the way utils/toolforge_get_versions.sh does the diff and not worrying? [11:18:29] yep [11:18:34] do you think it's safe for me to proceed to deploy ingress-admission to tools? [11:18:42] yep 👍 [11:18:51] thanks, finger crossed [11:18:53] I'll fix the version script to handle different versions in different envs [11:22:27] the fix https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1083 [11:22:57] * dcaro lunch [11:23:05] I'm online, ping me if you need any help [11:23:32] tested and +1ed [11:24:47] ouch cookbook failed with helm failing with: Error: UPGRADE FAILED: context deadline exceeded [11:26:35] did I break something? (grafana doesn't seem to screem) [11:26:44] should I just retry? [11:27:10] usually helm will just keep the old version running if it fails to upgrade [11:27:33] * volans running utils/toolforge_get_versions.sh to be sure [11:27:34] "kubectl get events" might have some info, or I would browse with k9s from a control host [11:27:48] dhinus: on tools? from which host? [11:28:17] tools-k8s-control-7.tools.eqiad1.wikimedia.cloud [11:32:16] That's for the ingress-admission? [11:33:44] yep [11:33:56] there's a new pod that is stuck in ContainerCreating [11:34:01] the old pods are still running [11:34:13] I'm not seeing any errors though, not sure why it's stuck [11:34:59] see https://phabricator.wikimedia.org/P85362 [11:35:02] and comments below [11:37:16] seems stuck at [11:37:16] Normal Pulling 17m kubelet Pulling image "tools-harbor.wmcloud.org/toolforge/ingress-admission:image-0.0.72-20251118104433-d892c480" [11:37:32] from describe pod [11:39:09] in this case it's safer to kill the creating pod and let k8s retry OR retry the whole cookbook? [11:39:24] in case this was some sort of network blip that put it in a stuck state [11:39:27] I think it's safe to kill it, but I would check if we can pull that image [11:39:38] dhinus: on toolsbeta it worked fine [11:39:53] toolsbeta pulls from toolsbeta-harbor, I think [11:40:07] ah ok [11:40:07] or maybe not? not sure :D [11:40:22] I assumed we were deploying the same image from the same repo [11:40:29] s/repo/registry/ [11:40:32] not build twice [11:40:44] I'm pulling it locally [11:40:49] it seems to be slow locally as well [11:41:03] "podman pull tools-harbor.wmcloud.org/toolforge/ingress-admission:image-0.0.72-20251118104433-d892c480" [11:41:10] it's getting stuck :/ [11:41:31] Lost connection to orc for a bit [11:41:52] ok so first question first, assuming we fix the registry, and k8s will pull the image, will it break something given that the cookbook has exited? [11:42:11] It should not [11:42:13] does the cookbook or heml do any critical step after the deploy? [11:42:26] Not for most components [11:42:44] ok [11:42:51] Some might need redeployment or manual running of the steps that failed [11:42:52] the old version is also not pulling, I thikn there's some issue with harbor [11:43:18] (ex. Yekton migrating cards, Loki doing some post-hooks) [11:43:27] *tekton [11:43:43] *crds [11:43:58] XD, autocorrect is shifty [11:47:29] The object storage quota is pretty high, but not full [11:47:30] https://grafana.wikimedia.org/d/7120b794-4638-49f5-bccd-9716efc60f24/wmcs-object-storage-quotas?orgId=1&from=now-30m&to=now&timezone=utc [11:47:41] (the object one specifically) [11:52:08] I never saw this issue before. should we try restarting harbor? [11:53:34] back on my laptop, just pulled that image without issues [11:53:36] https://www.irccloud.com/pastebin/ecDdsMBA/ [11:54:08] it seems it finished pulling the image on k8s too? [11:54:22] eys [11:54:23] yes [11:54:24] https://www.irccloud.com/pastebin/E2utNq3p/ [11:54:30] retrying on my laptop... [11:54:49] the pods are all up [11:54:50] yes now it worked [11:54:51] checking version [11:55:17] :/ I feel that flaky errors are worst that non-flaky ones [11:55:51] volans: you can try redeploying if you want, it should not really do much, but it will run the functional tests [11:56:05] ack, doing [11:56:55] there was definitely something weird happening [11:57:02] indeed [12:05:37] deploy completed: https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1082#note_175606 [12:05:53] seems all good, merging the patch [12:12:31] actually, re-looking at the flow it needs a final approval [12:12:36] so once approved I'll merge [12:14:47] that last approval is probably not needed anymore [12:15:22] we did not have functional tests before, so double checking by someone was pretty critical, now unless there's a complex flow to test, should be ok to self-merge [12:17:30] btw, is the source of the flow diagram shared somewhere in case we need to edit it? [12:17:57] it's in the docs/ dir in the repo, along with the png [12:17:58] found it [12:18:11] (or whatever image format) [12:18:22] yep png :) [12:18:35] ack [12:20:03] * dhinus lunch [12:23:44] this looks like a cron somewhere [12:23:47] https://usercontent.irccloud-cdn.com/file/kOJXV2wi/image.png [12:24:00] the stuttering there, and testlabs getting out of quota [12:24:53] yes [12:26:59] hmm... might be the script that updates the stats [12:28:25] yep, I think it should be creating a new file then move it to the final path [12:37:31] this should be the fix for that I think, testing it [12:37:31] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1206866 [12:39:11] the file can be big? [12:39:37] if not and can stay in memory you can also just store the data in memory and then prom_file.write_text(content) [12:39:42] no, it's kinda small [12:39:53] that's an option yep [12:40:02] but either works if that's the issue [12:45:49] now you can check tools hitting the rate limit with `root@tools-k8s-haproxy-8:~# tail -f /var/log/haproxy/haproxy.log | grep -E 'tool-rate=[0-9]{3,}'` :) [12:46:03] (none right now I think) [12:47:01] let's put that in some wiki page :D [12:50:19] * volans lunch [12:51:26] dcaro: did you make sure toolviews still parses the log files correctly? [12:51:40] nope [12:51:57] looking [12:54:34] is the cloudrabbit2*-dev setup functional or WIP? I tried to install the rabbitmq-server security update on 2001, but the postinst triggers a restart of rabbitmq-server.service which doesn't complete. it's unrelated to the rabbitmq update itself, it also happens with the previous deb if I downgrade [13:00:38] I think they should be in use yes [13:01:30] the config is using rabbitmq02.codfw1dev.wikimediacloud.org, that points to [13:01:31] https://www.irccloud.com/pastebin/yTIeUETT/ [13:02:04] codfw installation has been breaking a bit lately though, taavi and andrewbogot.t were looking into it [13:02:13] (afaik) [13:03:46] in cloudrabbit2002-dev the service is up and running though [13:03:49] https://www.irccloud.com/pastebin/Iw7YcF4B/ [13:06:17] taavi: toolviews looks ok to me [13:06:21] https://www.irccloud.com/pastebin/mnCjaR6n/ [13:10:31] 2002 hasn't been updated yet, but on 2001 it's now also reproducible with having it reverted back to +deb13u1 [13:10:58] as in, the restart does not finish? [13:15:27] I think rabbit might be misbehaving there :/ [13:15:41] might need full cluster restart [13:16:25] gtg for an errand, I can try to check when I come back if nobody gets to it sooner [13:38:02] yeah, I can see in systemctl that it attempts to start it, but nothing really proceeds until it eventually hits some timeout [13:38:21] while e.g. the rabbitmq proc on 2002 took six seconds the last time it got started [14:21:46] dcaro: is it fine if I quickly reboot tools-harbor-2 to pick up the MTU fix? [14:36:33] taavi: should be ok, [14:39:21] done [14:39:34] I had to manually start the services in docker-compose, which is not ideal [14:43:10] andrewbogott: do you still need these instances? abogott-test-instance.account-creation-assistance.eqiad1.wikimedia.cloud abogott-testvm.wikicommunityhealth.eqiad1.wikimedia.cloud [14:43:56] taavi: nope, please delete [14:44:12] doing, thanks [14:44:58] hmm, I though puppet would be starting them :/ [14:45:12] (not sure if it was a systemd unit, or directly puppet though ) [14:56:17] moritzm, dcaro, are you still wrestling with puppet? [14:56:23] I mean, puppet + rabbit? [14:57:55] I'm not currently actively wrestling, but if you know how to make rabbitmq on 2001 start, that would be fantastic [14:58:03] something seems to make it stall [14:58:43] I will try a full purge and reinstall, if you have not done that already [15:04:00] andrewbogott: I did not get a chance to get to it yet, +1 for reinstall [15:04:40] the purge+install seems to have worked but of course that wiped out the cluster config so now I'm re-clustering the other nodes [15:13:34] moritzm: as unlikely as this sounds, I think rabbit was unwilling to cluster across the two different versions. With all three nodes running the updated package I was able to re-cluster and restart things. [15:13:58] I don't think this really teaches us anything for the next point release since this really shouldn't happen, but lmk next time we need an upgrade and I'll follow along. [15:19:37] wow, seriously? even the minor Debian micro patch releases matters [15:19:52] * dcaro :nervouslaugh: [15:20:00] the only difference between +deb13u1 and +deb13u2 is a single patch applied [15:20:13] thanks for sorting, TIL [15:20:19] It shouldn't; it's possible something else is going on. [15:21:13] I just know that 2002-dev was refusing to cluster with 2001-dev until I did a full reinstall on both. It really could've been anything. [15:21:21] But I suspect the 'refusing to cluster' thing was why the restart was hanging. [15:27:46] this one should be ready to be merged/deployed and actually needed before I could test anything on toolsbeta [15:27:53] https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/92 [15:27:58] if anyone has time for a review [15:29:01] lgtm [15:31:03] tthe README is not totally clear on how to deploy that [15:49:46] it seems the haproxy is down looking [15:53:13] dcaro: I set a status in the main channel. Anything folks can help with? [15:53:28] thanks, it seems haproxy is still getting traffic though [15:53:39] bd808: can you confirm you can't reach some tool? [15:54:02] https://sal.toolforge.org/ and https://bash.toolforge.org/ both timed out for me [15:54:32] seems like the ProbeDown alert for haproxy has been flapping this a fternoon [15:55:19] `curl https://bash.toolforge.org/` from dev.toolforge.org is looking like it will timeout. [15:55:26] we are hitting the frontend session limits? https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/infra-k8s-haproxy?var-interval=30s&orgId=1&from=now-30h&to=now&timezone=browser&var-cluster=P8433460076D33992&var-host=tools-k8s-haproxy-8&var-backend=$__all&var-frontend=$__all&var-server=$__all&var-code=$__all&refresh=5m&viewPanel=panel-45 [15:57:19] yep, looks like it [15:57:27] resources-wise the VM itself is not hitting any limits, so we can safely increment that [15:57:45] yep, agree [15:58:02] so let's start on that, and then look at the traffic doing that. making a patch [15:58:10] 👍 [16:00:14] I noticed that the number of sessions did increase yesterday when there was more incoming traffic, then went down, but it took a while to do so [16:00:24] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1206902/ [16:01:08] are you running PCC? [16:01:17] I can, one second [16:01:52] looks good from what I can see but pcc is always nicer to have for confirmation [16:03:00] https://puppet-compiler.wmflabs.org/output/1206902/7638/ [16:03:24] +1ed [16:03:59] taavi, we're in the toolforge meeting talking about you [16:04:13] e.g. "I think taavi merged it already" [16:04:27] ah, I completely forgot that the meeting is happening now [16:04:42] (well, by 'completely' I mean I remembered it 10 minutes ago) [16:05:19] new limit is live [16:05:35] let's see what else breaks down the stack now :D [16:37:41] follow-up: T410421 [16:37:41] T410421: Add paging alert if Toolforge HAProxy connection limit is reached - https://phabricator.wikimedia.org/T410421 [16:37:51] andrewbogott: changed the graph using the values, I can still clearly see when there were issues so I think it's ok, wdyt? https://grafana-rw.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview?orgId=1&from=now-12h&to=now&timezone=utc&var-cluster_datasource=P8433460076D33992&var-cluster=tools&viewPanel=panel-46 [16:38:29] I'm pretty sure it's better although the scale is crazy with those spikes :) [16:39:30] yep, you can still select whichever code you are interested (or hide the one with big spikes) [16:56:33] volans: added the grep command in the wiki here https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes#HAProxy_%28external_service_access%29 [16:56:44] feel free to rephrase/add/remove :) [16:56:47] (wiki style) [16:56:55] thanks! [16:57:26] looks good [17:45:23] * dhinus off [18:09:06] * dcaro off [18:09:09] cya tomorrow! [23:52:25] Cloudvirt1071 just crashed and rebooted, which produced some toolforge emails. I've moved the haproxy node off of 1071 in case it crashes again. [23:53:04] also: T410470 [23:53:04] T410470: cloudvirt1071 crash - https://phabricator.wikimedia.org/T410470