[03:17:41] FIRING: [4x] PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance prometheus1005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [06:35:53] ack, thank you Cole I'll take a look [07:17:41] FIRING: [4x] PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance prometheus1005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [09:59:45] godog: re. Grafana LDAP sync issue - I am not sure what you have changed (I haven't been renamed, have I?), but I can log in now [09:59:53] thanks :) [10:00:56] Southparkfan: sure no problem! yeah any user rename right now crashes the script, I've opened T374190 to followup and fix [10:00:57] T374190: grafana-ldap-users-sync breaks on renamed users - https://phabricator.wikimedia.org/T374190 [10:01:14] ah sorry, *any* rename breaks the script [10:01:26] wrongly assumed you meant *my account* was renamed haha [10:01:56] hehe yeah that's right any rename [10:39:03] Prometheus question - do I understand correctly that it can handle an HTTP redirect (303 presumably) from one /metrics endpoint to another? [10:40:10] context: Ceph has a Prometheus exporter that listens on port 9283 and hostname:9283/metrics is the expected endpoint for Prometheus to talk to. [10:40:38] But: in any given cluster there is more than one mgr host that might be serving the metrics, but only one will be active at once. [10:41:49] Currently the standby servers will either return nothing or an error code (you can pick the error code); or if you visit just http://hostname:9283/ then you get a small HTML page with

Metrics

pointing at the active endpoint. [10:42:55] ...but it would be possible (I think!) to have the standby /metrics endpoints instead return an HTTP redirect to the active endpoint. AIUI Prometheus can be told to honour these (follow_redirects: true), but I wanted to check this was actually a sensible thing to do before I start patching Ceph ;-) [10:51:09] what's upstream recommendation in this case? and what are other ceph users doing in the same situation? [10:51:42] the redirect standby -> active will presumably lead to prometheus effectively scraping the active metrics twice I think [10:52:03] assuming prometheus is configured to scrape all ceph hostnames individually [10:52:39] Emperor: ^ [10:57:14] I guess that was my other question - is it better to just point prometheus at all the mgrs and have some of them return nothing, or point it at one and have that redirect it appropriately? [10:58:03] [Ceph can run its own Prometheus storage & dashboard out of containers, but we're not doing that because we already have suitable infrastructure :) ] [10:58:23] godog: Emperor: We are in the same boat with cephosd100[1-5], except we have 4 standby mgr servers and one active. It strikes me that it might be OK just to scrape all five. [10:58:32] generally the former yeah, prometheus knows about all hostnames and scrapes them [10:58:51] some might not return metrics and that's fine [10:59:38] Great! That makes things simple for us, I think. [11:00:03] Cool, I'll do that rather than faff with redirects. [11:00:08] Thanks godog :) [11:00:16] sure np! thanks for reaching out [11:02:41] FIRING: [4x] PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance prometheus1005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [11:03:18] Awesome. Thanks also godog. Happy Friday. [11:03:58] cheers btullis, happy friday too [11:12:41] RESOLVED: [4x] PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance prometheus1005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [21:25:55] FIRING: SystemdUnitFailed: opensearch_2@production-elk7-codfw.service on logging-sd2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:35:25] RESOLVED: SystemdUnitFailed: opensearch_2@production-elk7-codfw.service on logging-sd2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed