[07:08:35] FIRING: ThanosSidecarNoConnectionToStartedPrometheus: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [07:08:47] ^^ it's me [07:13:35] RESOLVED: [2x] ThanosSidecarNoConnectionToStartedPrometheus: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [08:45:40] godog: just to be sure, is Monitoring::Host an exported resource like Nagios_host? [09:37:45] volans: yes, I ran the same puppetdb query [09:38:22] volans: https://phabricator.wikimedia.org/P77810 [09:38:45] doesn't need to be exported afaics but you get the idea [09:40:38] I don't doubt the query works, I was wondering if the fact that nagios_host is exported was giving us some additional certainty [09:40:55] probably not but wanted to confirm [09:49:36] my understanding is the same yeah, since effectively what we're checking for is whether puppet has ran on the host and thus the resources are in puppetdb [09:50:33] but that';s not what we do [09:50:59] we check that puppet has compiled the manifest so that exported resources are in puppetdb and will be picked up by icinga [09:51:49] so that we can run the downtime cookbook with --force-puppet and be sure that all the new checks will be there so that they can be silenced before triggering [09:52:19] so if we remove icinga from the picture the whole workflow will change not just the puppetdb query [09:52:43] while until we have icinga I assume that there will be a Nagios_host resource as any icinga check will need to live within a host definition [09:53:09] s/we check that puppet has compiled/we check that puppetserver has compiled/ (to be clear) [09:53:26] I see, I get what you mean now [09:56:26] given that I think then we can pause on removing nagios_host, and switch when we have the alertmanager-only silencing in place, also because that doesn't really need to wait (i.e. we can create the silences whenever) [10:00:08] another bit that we do now is that after the reimage we check/wait for icinga optimal and if optimal remove the downtime. It would be nice to be able to do the same with the alertmanager side [10:02:48] indeed, I think we'll have to rework the semantics of "host in optimal state" to mean "we are not seeing alerts related to the host within a time period" which we can do with a prometheus query on the ALERTS meta-metric [10:03:25] sure [10:04:50] I've reworded https://phabricator.wikimedia.org/T395449 with details from this discussion btw [10:05:31] great, thx [10:06:16] I'll also work on the "wait for optimal" bits [10:11:13] it might also enough if we can query alertmanager to check if there is any alert currently firing (including/excluding silenced ones based on a flag) for a given host/instance [10:11:22] *also be [10:42:33] that's true yeah