[09:09:13] * dhinus paged: NodeDown cloudvirt1063 [09:09:33] I'll have a look in a minute [09:22:07] there are a few more non-paging alerts: bastion-eqiad1-03 is down, tools-k8s-worker-nfs-66 is down [09:31:05] the same host failed a few months ago: T368093 [09:31:06] T368093: reapply thermal paste to processors in cloudvirt1063 - https://phabricator.wikimedia.org/T368093 [09:33:10] mgmt interface is saying "Server power status: OFF" [09:33:43] I could try restarting it but I don't trust it to stay on, I will take it out of service instead [09:40:04] I'm not finding docs for how to do it, the cookbook "wmcs.openstack.cloudvirt.drain" might or might not work, I'll try [09:43:31] the cookbook is not working. I'll try to find out how to do it manually from a cloudcontrol [09:49:09] I did "openstack aggregate add host maintenance cloudvirt1063" and "openstack aggregate remove host ceph cloudvirt1063", hoping it was the right thing [10:06:03] I found https://docs.openstack.org/nova/2024.1/admin/evacuate.html#evacuate-all-instances [10:06:23] which suggests running "nova host-evacuate FAILED_HOST" [10:13:44] it's not working, it's failing to authenticate [10:17:21] ok I just had to add --os-username, --os-password and other params as specified in https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Openstack_cli#command-line_flags [10:28:46] the vms were moved to other cloudvirts, but they seem to be in state "SHUTOFF" [10:30:16] I'm restarting them manually with "openstack server start " [10:30:43] (they were all in status=ACTIVE as shown by https://phabricator.wikimedia.org/T375223#10165616) [10:48:24] they're all back up, details in T375223 [10:48:24] T375223: NodeDown cloudvirt1063 - https://phabricator.wikimedia.org/T375223 [10:50:50] I will wait until monday to discuss what to do next, likely opening a ticket with DCops [11:42:51] dhinus: I merged in T375323 to that, I'm not sure why you need 2 automated tasks tbh [11:42:52] T375323: NodeDownForLong Node cloudvirt1063 has been down for long. - https://phabricator.wikimedia.org/T375323 [11:43:06] NodeDown and NodeDownForLong seem redundant [13:46:48] RhinosF1: thanks, I agree they're redundant, I'll check if one can be removed