[08:11:52] <arturo>	 investigating this alert
[08:11:53] <arturo>	 Service virt.cloudgw.eqiad1.wikimediacloud.org:0 has failed probes 
[08:12:46] <arturo>	 oh, went away
[08:15:04] <arturo>	 that was T375362
[08:15:04] <stashbot>	 T375362: ProbeDown  virt.cloudgw.eqiad1.wikimediacloud.org:0 failed when probed by icmp_virt_cloudgw_eqiad1_wikimediacloud_org_ip4 from codfw. Availability is 50%. - https://phabricator.wikimedia.org/T375362
[08:21:46] <dcaro>	 ack
[08:48:32] <dhinus>	 morning. cloudvirt1063 failed during the weekend, I was paged and put it out of service (T375223)
[08:48:32] <stashbot>	 T375223: NodeDown cloudvirt1063 - https://phabricator.wikimedia.org/T375223
[08:48:42] <dcaro>	 thanks for handling that btw
[08:48:55] <dhinus>	 I'm not sure if I should try restarting it, or just ask dcops to look at it
[08:49:34] <dcaro>	 is there anything in the logs?
[08:49:50] <dhinus>	 well when I looked it was in state shutdown
[08:50:12] <dhinus>	 and I haven't tried restarting it yet. I guess it might be worth trying to see if it comes up and have a look at the logs
[08:50:34] <dhinus>	 I will try that
[08:51:27] <dcaro>	 unless we have clear logs that tell us what was the issue, I think we should double check with dcops see if they have ideas
[08:51:56] <dhinus>	 can you log in to alerts.wikimedia.org? it fails with "Unauthorized" for me
[08:52:00] <dcaro>	 the last time the error was T368007
[08:52:00] <stashbot>	 T368007: NodeDown (cloudvirt1063) - https://phabricator.wikimedia.org/T368007
[08:52:08] <dcaro>	 (logs were found in syslog)
[08:52:43] <dcaro>	 I'm logged in, let me try logging out (if I find the button)
[08:53:25] <dcaro>	 I logged in ok
[08:54:04] <dhinus>	 hmmm same error with Grafana, if I try to log in. maybe I'm hitting a different server?
[08:54:34] <dhinus>	 idm.wikimedia.org is also showing "500 Internal Server Error"
[08:55:09] <dhinus>	 idp is working fine and shows me "Log In Successful"
[09:06:33] <dcaro>	 I don't see errors on idm.w.o, logged out and in without problem
[09:06:45] <dhinus>	 ok now it's working
[09:06:56] <dhinus>	 no idea what happened
[09:14:47] <dhinus>	 it's annoying that the alert keeps firing even if the node is now in "maintenance", it would be nice if it changed from "page" to "warning". also we have 3 alerts for the same thing (one page, one critical, one warning :P)
[09:14:56] <dhinus>	 I will open a task later to simplify that
[09:15:36] <dhinus>	 I'll make a coffee then I'll try to restart the node and check the logs
[09:41:13] <dcaro>	 that is in case it fires for a long time, or a short time right? (as in, it starts as a warning, if it keeps failing it 'escalates' as page)
[09:47:43] <dhinus>	 dcaro: something like that, down(warn)->downforlong(critical)->down(page), but I want to double check the definitions
[09:47:52] <dhinus>	 meanwhile the server is booting
[09:47:57] <dhinus>	 let's see if I can ssh
[09:49:10] <dhinus>	 I'm in
[09:51:54] <dhinus>	 similar error log to the previous time
[09:52:15] <dcaro>	 yep, probably dcops can help there
[09:53:56] <dcaro>	 I've added a silence filtering just the host to catch any new alerts that pop up and such
[09:54:07] <dhinus>	 thanks
[09:54:19] <dcaro>	 (should clear if there's no new alerts in ~1day)
[09:55:36] <dhinus>	 I'm thinking if there is a way to disable alerts automatically for hosts that are in "maintenance"
[09:56:08] <dhinus>	 I also wonder how production handles similar situations (a node going down in a cluster)
[10:00:46] <dhinus>	 hmm do you know why T375223 was created by faultfinder on Sep 19th, but the actual crash seems to be on Sep 21st?
[10:00:47] <stashbot>	 T375223: NodeDown cloudvirt1063 - https://phabricator.wikimedia.org/T375223
[10:06:15] <dhinus>	 I opened a dcops task T375372
[10:06:16] <stashbot>	 T375372: hw troubleshooting: server failure for cloudvirt1063.eqiad.wmnet - https://phabricator.wikimedia.org/T375372
[10:09:04] <dhinus>	 ipmi-sel shows a "Thermal Trip" event, confirming a.ndrew's theory it's overheating
[10:21:31] <dcaro>	 ack
[10:21:33] * dcaro lunch
[15:54:30] * arturo offline
[17:43:32] * dcaro off