[08:11:52] investigating this alert [08:11:53] Service virt.cloudgw.eqiad1.wikimediacloud.org:0 has failed probes [08:12:46] oh, went away [08:15:04] that was T375362 [08:15:04] T375362: ProbeDown virt.cloudgw.eqiad1.wikimediacloud.org:0 failed when probed by icmp_virt_cloudgw_eqiad1_wikimediacloud_org_ip4 from codfw. Availability is 50%. - https://phabricator.wikimedia.org/T375362 [08:21:46] ack [08:48:32] morning. cloudvirt1063 failed during the weekend, I was paged and put it out of service (T375223) [08:48:32] T375223: NodeDown cloudvirt1063 - https://phabricator.wikimedia.org/T375223 [08:48:42] thanks for handling that btw [08:48:55] I'm not sure if I should try restarting it, or just ask dcops to look at it [08:49:34] is there anything in the logs? [08:49:50] well when I looked it was in state shutdown [08:50:12] and I haven't tried restarting it yet. I guess it might be worth trying to see if it comes up and have a look at the logs [08:50:34] I will try that [08:51:27] unless we have clear logs that tell us what was the issue, I think we should double check with dcops see if they have ideas [08:51:56] can you log in to alerts.wikimedia.org? it fails with "Unauthorized" for me [08:52:00] the last time the error was T368007 [08:52:00] T368007: NodeDown (cloudvirt1063) - https://phabricator.wikimedia.org/T368007 [08:52:08] (logs were found in syslog) [08:52:43] I'm logged in, let me try logging out (if I find the button) [08:53:25] I logged in ok [08:54:04] hmmm same error with Grafana, if I try to log in. maybe I'm hitting a different server? [08:54:34] idm.wikimedia.org is also showing "500 Internal Server Error" [08:55:09] idp is working fine and shows me "Log In Successful" [09:06:33] I don't see errors on idm.w.o, logged out and in without problem [09:06:45] ok now it's working [09:06:56] no idea what happened [09:14:47] it's annoying that the alert keeps firing even if the node is now in "maintenance", it would be nice if it changed from "page" to "warning". also we have 3 alerts for the same thing (one page, one critical, one warning :P) [09:14:56] I will open a task later to simplify that [09:15:36] I'll make a coffee then I'll try to restart the node and check the logs [09:41:13] that is in case it fires for a long time, or a short time right? (as in, it starts as a warning, if it keeps failing it 'escalates' as page) [09:47:43] dcaro: something like that, down(warn)->downforlong(critical)->down(page), but I want to double check the definitions [09:47:52] meanwhile the server is booting [09:47:57] let's see if I can ssh [09:49:10] I'm in [09:51:54] similar error log to the previous time [09:52:15] yep, probably dcops can help there [09:53:56] I've added a silence filtering just the host to catch any new alerts that pop up and such [09:54:07] thanks [09:54:19] (should clear if there's no new alerts in ~1day) [09:55:36] I'm thinking if there is a way to disable alerts automatically for hosts that are in "maintenance" [09:56:08] I also wonder how production handles similar situations (a node going down in a cluster) [10:00:46] hmm do you know why T375223 was created by faultfinder on Sep 19th, but the actual crash seems to be on Sep 21st? [10:00:47] T375223: NodeDown cloudvirt1063 - https://phabricator.wikimedia.org/T375223 [10:06:15] I opened a dcops task T375372 [10:06:16] T375372: hw troubleshooting: server failure for cloudvirt1063.eqiad.wmnet - https://phabricator.wikimedia.org/T375372 [10:09:04] ipmi-sel shows a "Thermal Trip" event, confirming a.ndrew's theory it's overheating [10:21:31] ack [10:21:33] * dcaro lunch [15:54:30] * arturo offline [17:43:32] * dcaro off