[09:15:48] I think maintain-kubeusers is running just fine, I just checked it [09:15:53] the problem might be with prometheus [09:56:26] morning! what's up with tools-redis-7? [09:59:47] nothing apparently [09:59:55] I also checked the VM, and seems up and running [10:00:02] all the exporters are producing data [10:00:20] there is however an error in the redis-exporter [10:00:20] prometheus-redis-exporter[194603]: time="2025-01-31T09:27:16Z" level=error msg="Couldn't set client name, err: ERR unknown command `CLIENT`, with args beginning with: `SETNAME`, `redis_exporter`, " [10:00:28] I don't know what that is about [10:02:37] hmmm that sounded familiar, and a phab search returns: T366471 [10:02:38] T366471: [toolforge] [redis] Prometheus exporter logging errors - https://phabricator.wikimedia.org/T366471 [10:03:22] so that error message has been present for a long time [10:03:36] but the metrics were working [10:12:02] I have created T385262 [10:12:02] T385262: toolforge: alertmanager reports maintain-kubeusers as down, but it isn't - https://phabricator.wikimedia.org/T385262 [10:31:52] left a comment there, there is something weird in the metric data [10:36:56] the maintain-kubeusers metric is even different [10:39:24] the alert is NOT firing in https://prometheus.svc.toolforge.org/tools/alerts?search=maintain [10:39:30] but it is firing in https://alerts.wikimedia.org/ [10:39:52] weird! [10:40:27] oh maybe it's the fact we have two prom servers [10:40:32] and they are reporting different things? [11:01:55] yes it looks like the two servers are out of sync [11:02:05] now, how do we fix them? :D [11:02:28] * dhinus tries restarting the systemctl unit [11:04:04] "Replaying WAL, this may take a while" [11:05:52] the alert is gone [11:06:20] * dhinus continues to be surprised that "turning it off and on again" is indeed the fix for most software issues [11:12:47] the redis alert is still firing, but that's on a separate prom (metricsinfra), I'll do the same restart [11:15:29] excellent [11:15:41] maybe create a ticket for the papertrail, in case we see this happening again [11:16:13] or just rename Logged [11:16:21] sorry, rename T385262 [11:16:22] T385262: toolforge: alertmanager reports maintain-kubeusers as down, but it isn't - https://phabricator.wikimedia.org/T385262 [11:18:43] yep I'll rename that one [11:18:54] the restart did not fix it for metricsinfra :/ [11:20:16] ok it just took a little longer [11:23:48] hmm the alert is gone, but the metric is still out of sync [11:32:47] I restarted the prometheus unit on the second host, and the alert is firing again [11:33:02] I guess it depends on which prom server it's querying [11:33:09] right [11:33:25] but for the tools-prom servers, restarting one fixed the issue and the metrics are in sync [11:33:38] for the metricsinfra-prom, the restart was not enough [11:34:01] I don't remember what is the setup. I wonder if alertmanager should just hit one of the prom servers? and the other acts as a standby? [11:34:16] I'm not sure [11:34:34] there's also a thanos process on those ones [11:35:00] I'll try rebooting the entire VM metricsinfra-prometheus-3 [11:39:28] rebooted and the metric is still wrong :/ [12:07:23] dhinus: I think what I'm finding with the live-migration failures is that T383583 is much more widespread than I initially thought, affecting basically all VMs (at least when it comes to live migration) [12:07:23] T383583: VM nova records attached to incorrect cloudcephmon IPs - https://phabricator.wikimedia.org/T383583 [12:07:44] Doing a cold migration seems to resolve things, I haven't yet found a way to fix it without the VMs rebooting. [12:08:25] Do you have time to think about this? A starting place is https://www.reddit.com/r/openstack/comments/11ynmy0/best_process_for_replacing_ceph_monitors_in/ [13:51:57] * andrewbogott going to take a break, run errands, and see if a new idea appears [13:57:39] andrewbogott: if the libvirt definition needs changing, I don't see how that can be done without a reboot Ç:-( [13:58:14] as far as I know, libvirt doesn't support editing the XML file. Well, you can edit it, but it will only take effect on the next VM boot [13:58:58] arturo: yeah, that's what all my tests show [13:59:06] So we need to schedule more reboots. [13:59:19] I mean, 99% of users don't care about reboots anyway [13:59:33] yeah [13:59:42] I guess my question is if this can be done on a self-service [13:59:47] self-service fashion [14:00:03] or we need the migration to trigger the XML being generated again? [14:00:37] good question. I don't think an in place reboot does it, a 'hard reboot' might but that require horizon access... [14:00:45] I need to test all these scenarios [14:01:25] for toolforge workers, I guess we can do anytime, no? [14:02:02] yep [14:15:33] hard reboot works! [14:15:38] at least, for sample size of 1 [14:20:08] great [14:51:54] I created T385288 [14:51:54] T385288: Changing the IPs of cloudcephmons should not require VM reboots - https://phabricator.wikimedia.org/T385288 [14:52:48] if you have any ideas on how we could fix that, you can leave a comment there for future reference [14:54:15] dhinus: commented [15:02:13] thx dhinus [15:06:58] thank you both for the comments! [15:21:31] proposed announce email: [15:21:34] https://www.irccloud.com/pastebin/JLTmotXF/ [15:25:58] dhinus or arturo, quick proofread of ^ ? [15:26:08] sure [15:26:59] andrewbogott: LGTM, thanks [15:27:31] LGTM, nit: there's a double [1] in the last line [15:27:42] yep, I caught that and fixed it before I sent [15:27:48] no doubt introducing a different typo instead [15:27:52] what a mess [15:27:53] easy fix though! [15:28:38] OK, /now/ I am going to go get food and run my errands, unless either of you has anything else I can do for you before your weekend begins. [15:29:30] when I get back I plan to hack the toolforge worker reboot cookbook to do hard reboot rather than soft, and take care of those worker nodes. [15:31:10] 👍 [15:34:44] andrewbogott: sounds good, thanks and have a nice weekend! [15:50:20] it seems like LLMs know a lot about Cloud VPS vs Toolforge: T385064 [15:50:21] T385064: Assess opportunity to migrate from WMFR-OVH server to WMF Toolforge or WMF Cloud VPS - https://phabricator.wikimedia.org/T385064 [16:01:43] LOL! [16:11:02] i untagged our projects, if they're using ChatGPT surely that's all correct and they don't need our slow and manual human help :/ [16:14:20] sounds fair, I'm waiting for chatgpt to suggest asking for help in this channel :) [16:23:14] yeah... that task is not my favorite [16:24:25] taavi: if you're still around, do you know how to fix T385262? [16:24:25] T385262: alertmanager reports maintain-kubeusers and tools-redis-7 as down, but they are up - https://phabricator.wikimedia.org/T385262 [16:33:31] dhinus: what happens when you curl the node-exporter port on the redis node from the prometheus node that's seeing it as down? [16:33:47] (not in front of a laptop so can't be very helpful atm) [16:34:05] np, I can try doing some curls for you, but also it can wait until later [16:34:52] that'd tell if it's a networking issue or a prometheus issue [16:36:38] networking issue apparently, I cannot ping the redis node from metricsinfra-prom-3, but I can from prom-2 [16:37:10] that's your problem then :-) [16:38:27] fair enough :D [17:04:58] andrewbogott: T385291 _may_ contain a reproducible DNS issue [17:04:59] T385291: DNS resolver not working on Toolforge when loading PHP script via browser - https://phabricator.wikimedia.org/T385291 [17:26:33] * arturo offline