[07:08:35] <jinxer-wm>	 FIRING: ThanosSidecarNoConnectionToStartedPrometheus: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus
[07:08:47] <tappof>	 ^^ it's me
[07:13:35] <jinxer-wm>	 RESOLVED: [2x] ThanosSidecarNoConnectionToStartedPrometheus: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus
[08:45:40] <volans>	 godog: just to be sure, is Monitoring::Host an exported resource like Nagios_host?
[09:37:45] <godog>	 volans: yes, I ran the same puppetdb query
[09:38:22] <godog>	 volans: https://phabricator.wikimedia.org/P77810
[09:38:45] <godog>	 doesn't need to be exported afaics but you get the idea
[09:40:38] <volans>	 I don't doubt the query works, I was wondering if the fact that nagios_host is exported was giving us some additional certainty
[09:40:55] <volans>	 probably not but wanted to confirm
[09:49:36] <godog>	 my understanding is the same yeah, since effectively what we're checking for is whether puppet has ran on the host and thus the resources are in puppetdb
[09:50:33] <volans>	 but that';s not what we do
[09:50:59] <volans>	 we check that puppet has compiled the manifest so that exported resources are in puppetdb and will be picked up by icinga
[09:51:49] <volans>	 so that we can run the downtime cookbook with --force-puppet and be sure that all the new checks will be there so that they can be silenced before triggering
[09:52:19] <volans>	 so if we remove icinga from the picture the whole workflow will change not just the puppetdb query
[09:52:43] <volans>	 while until we have icinga I assume that there will be a Nagios_host resource as any icinga check will need to live within a host definition
[09:53:09] <volans>	 s/we check that puppet has compiled/we check that puppetserver has compiled/ (to be clear)
[09:53:26] <godog>	 I see, I get what you mean now
[09:56:26] <godog>	 given that I think then we can pause on removing nagios_host, and switch when we have the alertmanager-only silencing in place, also because that doesn't really need to wait (i.e. we can create the silences whenever)
[10:00:08] <volans>	 another bit that we do now is that after the reimage we check/wait for icinga optimal and if optimal remove the downtime. It would be nice to be able to do the same with the alertmanager side
[10:02:48] <godog>	 indeed, I think we'll have to rework the semantics of "host in optimal state" to mean "we are not seeing alerts related to the host within a time period" which we can do with a prometheus query on the ALERTS meta-metric
[10:03:25] <volans>	 sure
[10:04:50] <godog>	 I've reworded https://phabricator.wikimedia.org/T395449 with details from this discussion btw
[10:05:31] <volans>	 great, thx
[10:06:16] <godog>	 I'll also work on the "wait for optimal" bits
[10:11:13] <volans>	 it might also enough if we can query alertmanager to check if there is any alert currently firing (including/excluding silenced ones based on a flag) for a given host/instance
[10:11:22] <volans>	 *also be
[10:42:33] <godog>	 that's true yeah