[18:46:22] anyone fluent in grafana formulas? [18:46:37] rpki_requests_total{status="unreachable"} + rpki_requests_total{status="unreachable"} works [18:46:56] rpki_requests_total{status="invalid"} + rpki_requests_total{status="invalid"} works [18:47:19] but rpki_requests_total{status="invalid"} + rpki_requests_total{status="unreachable"} returns "N/A" [18:48:58] you likely want something like rpki_requests_total{status="invalid"} + ignoring(status) rpki_requests_total{status="unreachable"} [18:49:19] or sum(rpki_requests_total{status~="^invalid|unreachable$"}) [18:49:39] err s/~=/=~/ [18:49:58] more details on why this is so at https://prometheus.io/docs/prometheus/latest/querying/operators/#vector-matching [18:50:38] thanks! will read! [18:50:40] oh, you could _also_ write scalar(rpki_requests_total{status="unreachable"}) + scalar(rpki_requests_total{status="invalid"}), but I wouldn't [18:51:15] (it is brittle if more than one thing is ever exporting a rpki_requests_total metric) [18:52:48] I don't understand the scalar() thing [18:52:58] and the doc doesn't help me [18:53:02] https://prometheus.io/docs/prometheus/latest/querying/functions/#scalar [18:53:23] so rpki_requests_total{status="invalid"} actually evaluates to a vector [18:53:53] it will be something like: rpki_requests_total{status="invalid", instance="foobar:9001", [maybe some other labels]} = 12345 [18:54:10] if there is exactly one value, scalar will mush that vector into just 12345 [18:54:27] which + will then just add normally, instead of doing the vector label matching described in the first link [18:54:32] hm, interesting [18:54:37] I think that's what I was looking for [18:54:47] thanks! [18:55:10] if we ever ran multiple rpki exporters, though, we'd have rpki_requests_total{status="invalid", instance="exporter1:9001"}=12345 and also rpki_requests_total{status="invalid", instance="exporter2:9001"}=54321 [18:55:13] and scalar() would break [18:55:47] i'd use the sum() invocation above, it's the shortest to write :) [19:00:18] prom's documentation could use more examples [19:00:43] like it'd be nice if https://prometheus.io/docs/prometheus/latest/querying/basics/ had some sample data, queries, and results [19:01:26] cdanis: at the end of the day, I'm trying to get a % of a specific label out of all the labels [19:02:20] eg. % of status="invalid" out of rpki_requests_total{} status X/Y/Z [19:02:34] not sure what's the cleanest way to do it [19:08:05] for that I think you want something like: sum(rpki_requests_total{status="invalid"}) / ignoring(status) sum(rpki_requests_total) [19:08:40] I mean what you *actually* want probably also involves taking a rate() over the past few minutes or so [19:08:52] since what I gave will give you absolute-value-over-all-time [19:09:05] yeah, it's the sum part that I was struggling with [19:09:22] if we added proper recording rules -- which we should! -- then you could just write something very close to the example at https://prometheus.io/docs/prometheus/latest/querying/operators/ [19:09:26] method_code:http_errors:rate5m{code="500"} / ignoring(code) method:http_requests:rate5m [19:09:54] it would be neat if there was some codegen we had to make such things simple (cc godog) [19:11:02] :) [19:13:05] the rules of the style like that -- aggregated_labels:metric:function_and_timespan, e.g. method_code:http_errors:rate5m is a 5-minute rate of http errors aggregated by request method and response code -- are documented in https://prometheus.io/docs/practices/rules/ [19:14:11] anyway, this is something we need to make simpler for ourselves going forward! [19:15:08] if we're going to have more and more little services and more stuff in prometheus then it should be as easy as we can make it [19:17:20] cdanis: https://grafana.wikimedia.org/d/UwUa77GZk/rpki?refresh=5m&orgId=1&from=now-30m&to=now [19:17:37] I ended up using the "single value" widget with that calculation [19:17:48] otherwise the % is too low for the graph [19:18:42] haha, it took one prometheus 10 minutes longer to start scraping, just puppet run skew I bet [19:18:56] what does '1:100' refer to in the graph title? [19:19:53] cdanis: the webrequests are sampled [19:19:57] oh okay sure [19:20:42] so this is good enough for this use case, I could go into a snag you'd immediately hit if there was more than one rpki_exporter; I can also not if you'd rather just move on with your life ;) [19:22:06] cdanis: yeah, it's temporary and will not have another exporter, that's why I kept it simple [19:25:02] thanks for the help! [19:26:25] (since I can't resist it's that you'd have to do: sum by (status)(rate(...[5m])) and btw always do rate then sum https://www.robustperception.io/rate-then-sum-never-sum-then-rate ) [19:26:30] anyway np! [19:28:18] one last nit: I love enabling sparklines on singlestats, and have done so in this case [19:28:25] feel free to turn off if you don't like [19:39:52] I didn't know it was a thing :) [19:39:55] looks great [19:40:07] I also put the number in green so it looks safe [19:44:51] lgtm :) [19:45:09] (that kind of thing is a good thing to do in general)