[07:58:01] <_joe_> chaomodus: yeah we're advocating using go mod nowadays, to vendor all deps [07:58:09] <_joe_> in the repository [13:27:47] <_joe_> I am working on a mtail horror for T226815, and I had a question about prometheus metrics naming [13:27:57] ask away [13:28:09] <_joe_> should I use http_request_total as the guidelines seem to indicate [13:28:14] <_joe_> or namespace it? [13:28:25] <_joe_> like httpd_http_request_total [13:29:11] hm, okay so, where do you expect to use this mtail library? just appservers? [13:30:07] <_joe_> yes [13:30:29] <_joe_> else I wouldn't have asked :P [13:30:51] I'm leaning towards httpd_ as a prefix, it describes what is being monitored [13:31:12] the prefix also matches what we've done for other similar services, e.g. varnish [13:31:18] and I think we might want a metric label that describes what the httpd in question is for [13:31:19] so +1 [13:31:41] maybe, to start with, just the cluster of the machine? [13:32:33] <_joe_> uh isn't that already attached to all metrics? [13:32:41] <_joe_> also [13:32:57] depends on how the job is configured by puppet, typically yes cluster is there already [13:33:18] <_joe_> I'm not sure I get what is being done here: https://github.com/wikimedia/puppet/blob/production/modules/mtail/files/programs/haproxy.mtail#L10-L84 [13:33:28] oh yeah godog is right; I was thinking of the metrics exported by mtail on the central syslog hosts, which are different for reasons [13:33:39] <_joe_> I mean I know, histograms are not supported by our version of mtail [13:33:43] <_joe_> so we're simulating one [13:34:02] oh my [13:34:05] <_joe_> but this does something diffferent than what a histogram (in maths at least) does [13:34:14] <_joe_> if a request takes 1 ms [13:34:21] <_joe_> it will be counted in all buckets [13:34:32] yeah I think that is just wrong [13:34:36] <_joe_> I don't think this is the intended behaviour of histograms right? [13:34:38] <_joe_> ok [13:34:40] <_joe_> thanks :D [13:35:10] <_joe_> I felt like I misunderstood everything about histograms in prometheus [13:35:36] if you added a 'stop' per case there I think it would work as intended [13:35:43] <_joe_> yeah [13:35:49] <_joe_> or well, no? [13:35:52] er no, 'stop' ends the whole process [13:35:55] <_joe_> stop stops the whole process [13:36:02] we should just upgrade mtail :P [13:36:10] <_joe_> hold your horses [13:36:20] <_joe_> remember the issue I linked yesterday? [13:36:26] <_joe_> we will need to rethink testing [13:36:49] <_joe_> --one-shot is being removed [13:36:54] <_joe_> anyways [13:37:03] <_joe_> for now I'll work around it with a lot of verbosity [13:37:28] prometheus histogram buckets count requests that took <= the amount of the bucket [13:37:34] so that seems correct to me [13:37:59] <_joe_> godog: sigh seriously? that makes... no sense mathematically or statistically [13:38:18] good thing we're not doing math but operations :P [13:38:20] <_joe_> seriously why even have a +inf bucket then [13:38:32] that's all requests I think, I'm double checking [13:39:20] yeah I think so, I'm looking at https://prometheus.io/docs/concepts/metric_types/#histogram [13:40:17] <_joe_> godog: not according to what I'm reading https://prometheus.io/docs/practices/histograms/ here [13:41:17] <_joe_> godog: so basically haproxy is trying to mimic _bucket{le=""} [13:41:28] <_joe_> with basename>_bucket{""} [13:41:48] <_joe_> which is very different from what histograms do [13:42:09] <_joe_> so ok, we're trying to reproduce the UI, somewhat [13:43:59] I'm not sure, which UI? [13:45:02] <_joe_> the query one, but sorry I'm in a call now [13:46:23] ok, anyways I'm fairly sure what haproxy and varnishbackendtiming are doing is correct [14:02:15] <_joe_> yeah, I am looking at what prometheus does, and basically they chose to optimize storage for the most common query they receive [14:04:25] <_joe_> they're doing cumulative binning, which - even assuming no sampling or binning artifacts - is perfect to calculate quantiles, and bad for calculating anything else [14:05:28] <_joe_> I was fooled by the query syntax [14:10:11] yeah afaik that's the main (only?) use case for the histograms in this context [14:36:46] <_joe_> I'm not sure, given how buckets are created, how this https://grafana.wikimedia.org/d/0fj55kRZz/thumbor?panelId=35&fullscreen&edit&orgId=1 can have values like "1.18s" [14:37:12] <_joe_> I'm pretty sure I don't want to look at how histogram_quantile calculates that [14:39:32] <_joe_> "The histogram_quantile() function interpolates quantile values by assuming a linear distribution within a bucket. " [14:39:40] * _joe_ cries in a corner [14:40:20] it is not that bad an approximation if your buckets are at all sensible [14:40:35] <_joe_> 1 - big assumption [14:40:53] <_joe_> 2 - that is something you can only evaluate as long as you know how your data are distributed [14:41:16] <_joe_> say your data are distributed along a binomial distribution [14:41:32] <_joe_> for at least some of your buckets you're going to do gross errors [14:41:41] <_joe_> but I know we don't care about precision that much [14:41:44] <_joe_> this is not science [14:42:09] i think in most cases it will be pretty obvious looking at heatmaps when your bucketing is off [14:42:34] i'm not an actual scientist though ;) [14:43:18] <_joe_> don't get me wrong, most things we track usually follow somewhat a maxwell distribution, and we're interested in the long tails anyways [14:43:26] <_joe_> so it's ok, I guess [14:43:48] performance folks might be interested in those caveats though, I was thinking [14:45:04] <_joe_> so, in the case of thumbor, I'd say we might want to rethink the bucketing [14:45:39] <_joe_> and for instance, for mediawiki it would make sense to have different buckets for different entrypoints in the future [14:46:34] <_joe_> because in fact, endpoints are so different from one another that they should be treated separately. [14:46:52] yeah that should be possible, on different metric names though, with the same metric name I think we'll get confused [14:47:28] <_joe_> anyways, I don't know how much value there is in knowing that the 75th percentile of responses was 1.24 s instead than 1.20 though [14:47:52] <_joe_> but it means sometimes we reason on things that could be graphing artifacts [14:48:37] yeah I think that pretty much applies to every number we're looking at on grafana, it isn't an analytics or billing platform [14:48:57] <_joe_> yeah but, lemme make an example [14:49:45] <_joe_> say a lot of our requests take between 1.7 and 1.95 seconds, so that the bucket under 2 seconds contains 70% of requests [14:51:31] <_joe_> It's easy, if your next bucket is larger, say from 2 to 5 seconds [14:51:40] <_joe_> to grossly overestimate the 75th percentile [14:51:50] <_joe_> in a significant way [14:52:11] <_joe_> just because the distribution shifted 0.2 seconds, you could see a bump of 1 second or so [14:52:44] <_joe_> ok I'll stop my auto-nerd-sniping about statistics here [22:05:55] Puppet/Icinga CR for anyone who feels like reviewing it: https://gerrit.wikimedia.org/r/c/operations/puppet/+/520643 [22:15:26] lol I left comments on the Python script but then realized you just copied it from elsewhere [22:16:16] haha yeah saw your update [22:16:26] the puppet part looked fine to me fwiw [22:17:33] thanks!