[01:27:25] andrewbogott: T380531 -- it looks like horizon is missing some templates for the interfaces tab output [01:27:26] T380531: /project/instances/{instance id}/?tab=instance_details__interfaces returns HTTP 500; Interfaces tab never finishes loading - https://phabricator.wikimedia.org/T380531 [01:29:02] Also, the "openstack horizon eqiad1" dashboad in logstash is busted. [01:50:37] bd808: does that break the whole dang instance panel or only show up when you click on that specific tab? [01:50:50] just that tab [01:51:27] the the end user there is just an infinite spinner [01:51:55] oh good :) That tracks, I added in some network bells+whistles in anticipation of the upcoming ipv6 network addition, probably missed a piece. [01:51:59] Thanks for the bug [01:52:03] you have to open your browser's network traffic debug stuff to see the actual error [01:53:44] figures :( [08:41:25] investigating alert about no writes to galera [08:43:58] cloudcontrol1005 | wsrep_last_committed | 1070411385 | [08:44:08] cloudcontrol1006 | wsrep_last_committed | 1070411666 | [08:44:19] cloudcontrol1007 | wsrep_last_committed | 1070411666 | [08:46:47] I created T387828 [08:46:48] T387828: openstack galera no recent writes 2025-03-04 - https://phabricator.wikimedia.org/T387828 [08:57:18] morning [08:58:34] o/ [09:06:57] I think I will need to force-reboot cloudcontrol1005, as it seems mariadb wont stop even when requesting a reboot [09:14:28] okok, that's all the alerts around? [09:15:07] yeah [09:15:14] ack [09:17:15] dcaro: I don't access at the moment to pws [09:17:28] could you please force-reboot cloudcontrol1005 from the mgmt console? [09:17:34] ack [09:17:47] ssh cumin [09:18:19] sudo install_console cloudcontrol1005.mgmt.eqiad.wmnet [09:18:28] then [09:18:28] serveraction powercycle [09:18:45] you may want to do `console com2` first to see where it is stucked [09:18:54] (it should be mariadb) [09:19:47] t just says shutdown in progress [09:19:55] (systemctl trying to stop the service) [09:20:04] for 2h [09:20:06] yeah [09:20:15] [ *** ] Job mariadb.service/stop running (2…h 49min 51s): Shutdown in progress [09:20:19] hit it with the powercycle [09:20:48] hopefully there is no underlying disk problem or similar [09:22:04] what's the command? [09:23:54] there you go `serveraction powercycle` [09:25:28] booting up now [09:26:23] great, thanks [09:26:52] you should have ssh now I think [09:29:50] I think something is borked\ [09:29:55] Mar 04 09:29:30 cloudcontrol1005 designate-sink[2291]: 2025-03-04 09:29:30.830 2291 ERROR oslo.messaging._drivers.impl_rabbit [None req-2a71569a-b372-466c-81b7-0709c9c4a882 - - - - - -] Connection failed: [Errno -3] Temporary failure in name resolution (retrying in 7.0 seconds): socket.gaierror: [Errno -3] Temporary failure in name resolution [09:30:19] https://www.irccloud.com/pastebin/OkvBt2pk/ [09:31:01] https://www.irccloud.com/pastebin/bbzLsRHY/ [09:31:13] cloudcontrol1006 has the same nameserver and it works [09:31:33] mmmm [09:31:51] the reboot cookbook (was still running) has not yet detected the server being back online [09:31:56] it seems to be missing the gateway [09:32:15] wait no [09:32:22] https://www.irccloud.com/pastebin/O52aCEFW/ [09:32:30] and 1005 [09:32:35] https://www.irccloud.com/pastebin/cGhBNcx5/ [09:33:12] that looks ok rightL [09:33:25] why the two servers have different nameservers? [09:33:55] they have thesame, 10.3.0.1, different gateway [09:34:05] 1005 can't ping it's gateway [09:34:08] https://www.irccloud.com/pastebin/7TL6icBd/ [09:34:45] side note, I'm back to having pws access [09:34:54] awesome :) [09:35:13] ups, conflicting console access [09:35:18] I think we are in the same session, I'll let you do [09:35:24] ok, thanks! [09:35:29] yeah, I can take it from here [09:35:32] thanks for the assistance [09:36:08] something wrong with the network may explain the galera problem [09:41:59] topranks, XioNoX is it concerning that a cloudswitch port is marked as down? https://usercontent.irccloud-cdn.com/file/49dlCqhG/image.png [09:42:56] arturo: on which device? [09:43:01] depends on what it’s supposed to be connected to [09:43:14] cloudsw1-c8-eqiad [09:43:15] but if it's marked down and it's not supposed to be down, then yes :) [09:43:38] arturo: we don’t have any switch links at 1G so that’s incorrect somehow [09:43:39] it is supposed to be UP, there is a server connected, expected to be serving traffic :-P [09:43:54] sorry reading it wrong [09:43:58] Last flapped : 2025-03-04 09:23:59 UTC (00:19:29 ago) [09:44:14] yeah if the port just went down it’s a worrry [09:44:38] looks like the port went down, usually a cable issue, or the host it's connected to went down, or (less likely) a NIC issue [09:45:27] XioNoX: so we had a database alert on the host, I could ssh to it. Then I rebooted, and upon reboot the server can't reach the gateway. That's the timeline [09:46:12] looks like it went down, then up, then back down [09:46:31] could be useful to know the host/OS status from IDRAC [09:46:42] the OS is up and running [09:46:58] ok, then you can file a DCops tack for investigation [09:47:00] this is cloudcontrol1005.eqiad.wmnet [09:47:05] ok, thanks [09:47:20] I'd bet on a loose cable [09:47:33] great, thanks for the assistance [11:38:18] * dcaro lunch [12:50:24] dcaro: https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1124421 [13:37:18] tools-legacy-redirector looks very unstable in the past week [13:37:33] not sure what's causing that [13:39:08] I'll try rebooting the VM [13:55:15] dhinus: T385908 <-- this is the ticket we have been using so far [13:55:15] T385908: toolforge-legacy-redirector: constant failed probes by prometheus - https://phabricator.wikimedia.org/T385908 [13:57:39] arturo: thanks! [14:04:27] np [14:04:30] * arturo food [14:32:11] have you ever seen cadvisor failing to start in lima-kilo with "Failed to create a manager: could not detect clock speed from output"? [14:32:41] I'm not sure if it's connected to the upgrade or not [14:40:31] that does not ring any bells for me [14:41:31] might be mac related https://github.com/google/cadvisor/issues/2237 [14:55:45] interesting, I wonder if it ever worked on my machine :D [14:56:53] there's no clock speed at all in /proc/cpuinfo inside the vm [14:59:41] probably this is the issue: https://github.com/google/cadvisor/issues/2237#issuecomment-1401923321 [15:03:53] I could try using the ARM binaries for kind and node images, but it looks like it only affects cadvisor and everything else is working fine [15:21:36] we don't usually check metrics in lima-kilo yet, should be ok I think [16:00:13] dhinus: oh! I missed the xkcd link you passed (saw it right when clicking close xd), can you reshare? [16:00:19] ha sure https://xkcd.com/1205/ [16:00:26] (it was still in my clipboard :D) [16:01:06] on a different topic, my restart of toolforge-legacy-redirector did not help, the probes are still failing [16:03:28] yeah :-( I think at this point we have tested all the easy solutions, and none of them have worked [16:30:54] * arturo offline [17:16:19] I'm adding a silence for the tools-legacy-redirector alerts beceause they are flapping like crazy [17:57:13] there's the https extra redirection we can try to changi