[09:38:49] godog: I have been trying to make sense of the prometheus/alertmanagers stuff that you have been mentioning in the last few days. I created this ticket: T374599 [09:38:49] T374599: cloud: prometheus: investigate weirdness with metrics and alertmanager - https://phabricator.wikimedia.org/T374599 [09:39:02] godog: I would appreciate your input/assistance [09:41:08] arturo: for sure, checking [09:41:42] thanks [09:51:52] so far I have more questions than answers :( both prometheus1005 and prometheus1006 report CephClusterInUnknown as not firing, yet didn't send the recovery according to the logs dashboard [09:52:04] https://logstash.wikimedia.org/goto/c498d70834ca930e2b6561e1e73b5be9 [09:52:54] for example: why is this happening? unclear at this time [09:54:03] godog: I'm glad that you see the weirdness too [09:56:01] heheh wish I could say the same tbh [10:15:22] arturo: I'll keep investigating and I can't find a smoking gun yet [10:15:44] godog: thanks, I'll keep an eye in the ticket. Really appreciated [10:16:37] sure np [10:43:35] !log wlm-it-visual "Not reachable since 3AM looking at Grafana. Soft reboot from Horizon" [10:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wlm-it-visual/SAL [10:51:16] !log admin merging change to keystone wmf hooks https://gerrit.wikimedia.org/r/c/operations/puppet/+/1071230 (T374020) [10:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:51:21] T374020: openstack: instrument VXLAN-based flat network - https://phabricator.wikimedia.org/T374020