[06:17:39] anyone knows where the "wmf - metamonitoring - thanos - notified - vip is now DOWN" pag.e comes from? Is this critical? I'm currently reading docs in wikitech about metamonitoring and deadmanswitch [06:20:20] jelto: I'm going to check ... yesterday we had a false positive due to a silence in Karma/Alertmanager. [06:20:37] great thanks [06:23:14] there is a silence 0fcc6f25-3881-4073-a0c5-cdec04290212 for DeadManSwitch which was created 17 hours ago during the wikikube upgrade. maybe this silence has to be removed again? [06:26:17] no jelto, will silence everything except the deadman’s switch alerts ... checking [06:26:36] I mean 0fcc6f25-3881-4073-a0c5-cdec04290212 will silence everything except the deadman’s switch alerts [06:28:57] Ah you are right, I missed the != [06:29:21] So it's probably just the gunicorn/python application crashing? [06:35:13] I think this is a bad behavior of HetrixTools, which in some cases doesn’t seem to respect the configured timing. [06:54:32] I added the stack trace to https://phabricator.wikimedia.org/T397003 just in case :) [06:57:59] jelto: ack, thanks. Anyway, I can confirm that in case of a timeout (due to the exception), HetrixTools does not wait for the retries... [06:59:57] that seems a bit sad [07:03:25] I agree Emperor ... I'll run some tests today (changing the routing key to avoid further noise) to see if other parameters can prevent this behavior. [07:03:53] thanks :) [13:28:56] jelto: Emperor, sorry for the noise this morning. The false positive was caused by an issue that was (incorrectly) assumed to be already handled by the existing configuration, combined with HetrixTools' default behavior of not respecting the retries setting in case of a timeout. I submitted a patch to replace Gunicorn with uWSGI and I’m actively working on the HetrixTools [13:28:58] parameters to avoid such false alerts. [13:33:18] thanks for the update