[07:21:30] greetings [07:33:06] yesterday I and T407586 had a good chat and we're finally getting somewhere [07:33:07] T407586: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586 [07:35:33] tl;dr looks like on newer kernels the "optimal i/o size" from drives attached to that controller get misreported, which in turns makes the raid10 reported "optimal i/o size" about 4GB, which confuses lvm to create a big metadata area based on that size, which in turn confuses grub into allocating 4GB when scanning for lvm [07:57:40] I'd like to roll ingress-nginx forward again in tools: https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1085 [08:00:57] taavi: +1, AIUI should be pretty quick to verify if things are working as expected (?) [08:01:11] yeah [08:16:48] ack [08:28:29] T410470 affected the active tools haproxy node, so we got what I think was the first actual test of the new keepalived failover config there (which worked exactly as intended) [08:28:29] T410470: cloudvirt1071 crash - https://phabricator.wikimedia.org/T410470 [08:36:48] very nice [08:38:32] \o/ [08:57:36] for T343885, which changes some metric and label names, I'd like to merge the prometheus config (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1203427) and alert updates (https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/46) at the same time, and then go around updating dashboards [08:57:37] T343885: [promethus,haproxy] Move to haproxy internal metrics from haproxy_exporter - https://phabricator.wikimedia.org/T343885 [09:01:28] okok, thanks for the heads up [09:03:15] I'll wait until then to tackle T410421 then [09:03:16] T410421: Add paging alert if Toolforge HAProxy connection limit is reached - https://phabricator.wikimedia.org/T410421 [09:26:17] paws down alert triggered, looking [09:27:15] 3 nodes are marked as not ready, 1, 2 and 4 [09:27:49] "Kubelet stopped posting node status" [09:29:18] I'll reboot one see if it comes up, I'll try to ssh to another [09:31:02] that seemed to work, I'll reboot another, and try to debug the third [09:31:11] +! [09:31:12] +1 [09:31:51] ssh seems to be hanging for node-4 :/ (that was my shoot at debugging it) [09:31:57] *shot [09:32:24] I'll reboot it, and try to do some debugging post-reboot [09:32:59] fyi. I'm rebooting with `root@cloudcumin1001:~# cookbook wmcs.vps.instance.force_reboot --project paws --cluster-name eqiad1 --vm-name paws-127c-uwce57bvcgrt-node-4` [09:33:07] node 2 up and running [11:11:51] * taavi still looking for reviews for the prometheus exporter change patches linked above [11:22:48] thanks dhinus. I apologize in advance for any possible false alerts that are about to happen [11:22:56] ack [11:31:09] sigh, the elastic boxes are old enough that the metrics are different [11:32:38] https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/50 [11:37:34] I did not understand you wanted review, sorry [11:38:18] yeah I should have been more clear about that :/ [11:39:06] +1d that one [12:56:00] ok I think at least I was able to work around the trixie failing to boot on r450 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1207150 [12:57:11] looking at https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts made me think, should we move there the toolsdb alerts that are currently stored in mariadb in metricsinfra-controller-2? [12:57:46] SGTM [12:58:01] probably best to have all toolforge rules in one place, yeah [12:58:27] godog: neat. that feels like something to report upstream as well [12:59:04] taavi: indeed, I'm glad I got something reportable as a bug to Debian now [13:00:59] godog: awesome job! [13:03:18] thank you <3 took the better part of yesterday and I got some new knowledge on how all of this is supposed to work [13:04:43] taavi: I created T410505 [13:04:43] T410505: Move all alerts to the toolforge/alerts git repo - https://phabricator.wikimedia.org/T410505 [13:11:56] there's also one for toolsbeta, but I think we don't have a toolsbeta prometheus, right? [13:12:06] so that one will have to stay in mariadb [13:13:52] ah we do have it actually! https://prometheus.svc.beta.toolforge.org/ [13:14:28] I guess the same toolforge/alerts rules are deployed to both tools and toolsbeta [13:14:52] yes, unless specifically configured not to in the file (https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/#filtering-instances) [13:15:01] great, thanks! [13:50:25] godog: congrats, that was a big rabbit hole xd [13:51:00] dcaro: lol thank you, indeed a big rabbit hole [14:29:38] * godog errand [14:33:26] godog: this is great work, thank you for diving back in! [14:39:15] Is someone already on top of the object storage quota thing? [14:44:35] taavi: I've reimaged most of the cloudvirts and now in syslog I'm seeing a lot of [14:44:39] "dropped over-mtu packet: 1500 > 1450" [14:44:50] I don't see that in syslog on the remaining bookworm hosts. [14:45:03] I'm worried this means that my reimaging has somehow undone some of your mtu adjustments [14:45:38] hmmm no, it's per-VM, some have 1500 mtu and some 1450 mtu [14:45:47] so maybe this was expected and the trixie logs are just noisier [14:45:56] that seems possible [14:46:08] does that error message have any other detail? [14:46:50] not really. It's just a wave of [14:46:52] https://www.irccloud.com/pastebin/ThFb5Hiq/ [14:46:59] cloudvirt1065 is an example, if you want to look [14:47:07] well you do have the interface name there [14:47:16] yes :) [14:47:37] which can be matched to a neutron port iirc, so pretty easy to see if stop-starting that instance makes those warnings go away [14:47:45] so this is just another case of 'we need to reset those VMs' [14:49:27] I'd treat it as that unless proven otherwise [14:50:22] great. I only noticed because of T410470 where a warning like that was a server's last gasp before crashing. But I didn't think it was related to the crash anyway. [14:50:22] T410470: cloudvirt1071 crash - https://phabricator.wikimedia.org/T410470 [15:03:24] so, I think I'm ready to deploy to toolsbeta the new infra-tracing stuff (and implicitely the logging and registry-admission components too) [15:03:59] I wonder if it would be ok though to have it deployed just on toolsbeta for a day before messing up with tools or once deploying it's important to go all the way in a shorter timeframer [15:04:21] if nothing depends on it I think it's ok [15:04:37] yeah, I think that's fine as it's not deployed at all to tools yet [15:04:39] nobody is working on it either (so no risk of someone deploying a different version of it) [15:05:42] ack, dcaro you were suggesting to deploy it manually, do I have to reverse engineer the cookbook or we have the manual steps listed somewhere? [15:06:26] why manually? [15:06:44] do we have a loki ingestion dashboard on grafana to see that I'm not breaking the existing logging instance? [15:06:56] a quick search didn't find it [15:06:56] manually as in running `./deploy.sh infra-tracing` on the control node [15:07:26] you can use the cookbook if you want, it will try to run all the tests unless you skip them (iirc with `--skip-tests`), you'll have to pass the branch name [15:07:38] and use the component `infra-tracing` [15:08:01] you'll need to deploy the ingress-admission first if it's not there yet though [15:08:12] yes [15:08:17] that's deployed [15:09:20] * volans gets always confused between ingress-admission and registry-admission [15:10:13] ingress-admission I think I've already deployed the other day, registry-admission change is part of the same MR [15:10:30] and needs to be deployed first [15:11:28] but maybe is safer if someone wants to pair with me for this? :) [15:11:37] sure [15:16:04] if you want/have time let me know when is a good time :) [15:24:27] we can do in the checking meeting in 5 min [15:25:07] sure although I have no meeting in my calendar, care to send me the meet link please? :) [15:27:46] done :) [15:28:49] hx [15:28:51] thx [16:05:35] taavi: when you talk about stop/starting a VM, are you doing that with virsh or with the openstack APIs, or some other way? And did you/we decide if we want to just do that to every VM with a mismatched MTU or only as needed? [16:05:54] cookbook wmcs.vps.instance.stop_start [16:06:20] restarting instances can be disruptive, so so far I've only done that on VMs where I know it's been safe [16:06:27] I don't think we made a decision on doing that globally [16:09:04] ok! [16:09:18] * andrewbogott will look to see what that cookbook does exactly [16:09:39] it is very complicated [16:09:47] cold start? [16:10:07] self.openstack_api.server_stop(self.vm_name) [16:10:07] self.openstack_api.server_start(self.vm_name) [16:10:11] :D [16:27:53] taavi, still interested in examples of VMs that won't migrate (https://phabricator.wikimedia.org/T408543#11377820)? I have about a dozen, a surprising number [16:59:31] andrewbogott: I guess you didn't see my reimage did you? [16:59:47] ah dang did I step on your toes? [16:59:54] I just wanted to see it work for myself :) [17:00:22] not sure, did the cookbook fail due to locking error? [17:00:34] but anyways sure, the reimage I launched is finishing [17:00:37] no but it was confused by the icinga state so I aborted it [17:01:05] So I will leave it alone, let me know when it finishes? [17:01:14] ok will do [17:02:02] godog: also, I self-merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1207202 [17:02:31] ack, yes I had the same patch out too lolz [17:03:39] i thought you'd gone for the day, otherwise I would be coordinating better :) [17:05:05] no worries, I am in fact about to go shortly [17:05:19] andrewbogott: all yours, feel free to reimage [17:11:43] * dhinus off [17:44:24] * volans off [18:39:39] bd808: do you think there is something we should do about T409493? [18:39:39] T409493: Toolforge interwiki link handling no longer strips URL-encoding before redirecting when it previously did, breaking existing on-wiki links - https://phabricator.wikimedia.org/T409493 [18:46:15] * dcaro off [19:19:16] taavi: I don't really know what there is to do except maybe a cloud+wikitech-l announce that there was a bug that let query strings pass through and that functionality won't be returning? [19:19:38] I don't think there really is a safe way to put the mis-feature back [19:21:03] We could try to do something like scan for an encoded `?` and url decode from there to the end of the string, but that would not be guaranteed to produce functional URLs