[01:49:48] FIRING: PuppetFailure: Puppet has failed on alert1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:49:48] FIRING: PuppetFailure: Puppet has failed on alert1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:49:48] FIRING: PuppetFailure: Puppet has failed on alert1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:52:53] o/ how can I do something like graphite highestMax with prometheus&grafana, topk does return more series and not seeing something like topk_max available? [12:09:32] dcausse: highestMax returns a scalar? [12:48:13] akosiaris: it filters the series with top X but looking at the max value for each in the period (https://graphite.readthedocs.io/en/latest/functions.html#graphite.render.functions.highestMax) [12:49:21] ended up using topk(10, ...) anyways with prometheus, it draws more than 10 lines but it's OK for me [12:50:23] if there's a way to limit to the actual top 10 looking at the max over the time period I'm interested [12:58:47] it draws more than 10? that hasn't been my experience. Got a link to dashboard handy? [13:25:29] akosiaris: sure, for instance: https://grafana-rw.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?forceLogin&orgId=1&editPanel=88 [13:25:55] seems to take the top 10 for every data point and then accumulate the series [13:32:20] Ah, I see your point. You expected only 10 types at the end but you see 14. So, this first rates, then sums by type, ending up having N time series vectors (N=24 in this case to be precise) and then just picks the highest 10 of them per interval. But what gets in the top 10 per interval isn't the same across the entire timespan so you get 14 out of [13:32:20] the 24. [13:49:48] FIRING: PuppetFailure: Puppet has failed on alert1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:04:13] I can't find a way to not have it behave like that. It does make sense overall that if you want top N per a big timespan you get >= N because there is bound to be some anomaly in the data somewhere [14:04:49] the only thing I can think is if you don't have to graph a time series but just an instant (e.g. with the stat panel) [14:05:22] you would probably have to use a subquery [14:08:13] checked https://stackoverflow.com/questions/38783424/prometheus-topk-returns-more-results-than-expected but I barely understands the few workarounds [14:13:18] ah, there is https://prometheus.io/blog/2021/02/18/introducing-the-@-modifier/ [14:13:21] but that's not what you want, I think [14:13:56] the `topk_max` that VictoriaMetrics has looks nice, but we aren't running that ofc [14:14:03] cdanis: I was looking at https://www.robustperception.io/graph-top-n-time-series-in-grafana/ but that's too obscure for me :/ [14:14:30] interesting [14:15:46] dcausse: do you want the K with the highest mean over the time window, or the highest maximum instant value, or ? [14:16:22] I want the k with the highest max in the period [14:17:07] I could go with the highest mean tho [14:26:20] dcausse: I added a var https://grafana-rw.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?forceLogin&orgId=1&editview=templating&editIndex=2 [14:27:05] you can reference it in a query like {type=~"$max_qps_types"} [14:27:59] cdanis: thanks! will try this now [14:28:22] there's some guesses at what you wanted in that query :) [14:29:48] RESOLVED: PuppetFailure: Puppet has failed on alert1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:31:00] also, it will likely be slow over long time ranges, some of that is possible to fix and some of that is intrinsic [14:37:52] cdanis: seems like the var is set with the first value of the array by default, checked multi-value but only one is selected, I have to display the var on the dashboard explicitly select all of them by hand [14:38:01] uh hm [14:38:49] dcausse: ok try refreshing [14:39:03] https://i.imgur.com/zVBQx9x.png I think this is what was needed [14:40:16] same, what seems to work is: display the var, explicitly select "all" then hide again [14:40:25] heh [14:40:27] hmm [14:41:19] I guess just leave it displayed then [14:42:45] cdanis: sure, thanks for the workaround! <3 [14:47:32] this @ end() syntax starting with 2.25.0 is exactly what I was thinking on how to do it, but alas [14:47:36] it's pretty neat [14:48:25] I actually like the original behavior and I think usually the issue is that people set K way too high and also don't display any percentiles or other aggregates 😇 [14:48:59] way, this was introduced at 2.25 and we run 2.48 [14:49:02] wait, this is just https://wikitech.wikimedia.org/wiki/User:CDanis_(WMF)/Use_more_heatmaps [14:52:22] akosiaris: could not get @ end() to work tho, it selected an instant but if you find a way that might be a nice work-around too [14:58:37] @ modifier is disabled by default and can be enabled using the flag --enable-feature=promql-at-modifier [14:58:41] we need that [14:58:49] from what I see, we don't pass it [14:59:31] I wonder whether they stop guarding it behind a feature flag at some point [15:00:14] strange... I got a query doing something different with it [15:09:18] it works but does not solve my topk issue, all the values displayed are the last one, e.g. topk(10, histogram_quantile(0.5, avg(rate(mediawiki_CirrusSearch_request_time_seconds_bucket{search_cluster="eqiad"}[5m] @ end())) by (type, le))*1000) [15:10:16] dcausse: yeah, it's a solution to a different problem [15:11:13] yes... [16:07:01] actually found a ugly workaround with @ end(): sum by (type) (rate(mediawiki_CirrusSearch_request_time_seconds_count{search_cluster="eqiad"}[5m])) + on (type) (topk(10, sum by (type) (rate(mediawiki_CirrusSearch_request_time_seconds_count{search_cluster="eqiad"}[5m] @ end())))*0)