[09:28:28] tomorrow's pad is up! [14:06:59] I got asked (indirectly) to create a prometheus exporter of bacula, and I would like to have a discussion, if possible, a on how to best organize that- not time sensitive (before I start to build anything) [14:08:08] what stats does the bacula master provide already? [14:08:31] no metrics at all [14:08:48] my questions is to gather ideas on how to translate info into metrics that makes sense [14:09:06] e.g. a label per job [14:09:20] or some other organization [14:09:29] how many bacula jobs do we have, approx? [14:09:34] 92 [14:09:54] oh okay, i was worried that it was thousands [14:10:21] so I just have a lot of questions on how to best organize that [14:10:57] also if I should cache because queries could take a long time, or just create a wrapper around the db, metric organization, etc. [14:11:28] how long is a long time? :) [14:11:45] currently icinga takes around 16 seconds [14:12:04] I could do it much faster, but I would relay on internal db structure, which I wanted to avoid [14:12:28] (the "api" [sic] is per job) [14:13:05] basically, I have a bunch of information, but not in a time series [14:13:25] so I am wondering how to "convert" into it [14:15:22] while info on bacula is mostly event-based X happened at Y time and result was Z [14:16:47] from what you said it seems that periodically dumping the metrics would be best, e.g. in a plaintext file like we do for smart, it wouldn't be as general-purpose as an exporter but it'd work [14:17:06] on the events I'd say a good start would be to export the unix timestamp of the last success of a job [14:17:27] the unix epoch rather, being the metric value [14:17:58] "last success of a job" is ok, and I will want that [14:18:28] but the thing I got asked to have first is "backup success rate" [14:19:35] ah, yeah in that case using 0 or 1 as values and "job success" as the metric would work [14:20:03] and a counter that gets increased, for tried, aborted, errored? [14:20:11] e.g. bacula_job_success{job="..."} 0/1 [14:20:28] oh, I don't need to count, true [14:20:31] yeah i think a last-successful-timestamp gauge, and a last-run-status gauge [14:20:38] prometheus can do it for me [14:21:27] wait, but how do I reconcile the "job execution rate" vs the "exporter execution rate" [14:22:03] e.g. in 1 minute 0 jobs could have run, or 5 [14:22:05] if you wanted an actual rate of job success/failure you'd need counters [14:22:58] indeed [14:23:23] I will give it a think, think of concrete metrics that would be interesting [14:23:30] and then ask for feedback [14:23:34] another route, if this can be easily extracted from bacula logs, is mtail [14:23:36] i can also imagine something like [14:23:37] thanks for the suggetsions [14:23:41] a "bytes stored" gauge [14:23:43] for capacity planning [14:23:57] maybe the existing filesystem metrics are enough for that, i don't know [14:23:58] yeah, that is the easy part :-) [14:24:16] because that is already "aggregated" [14:24:21] herron: what do you mean? [14:25:34] oh, I see logs-> metrics [14:25:37] we use mtail in some places to gather metrics from log events, where a native exporter isn’t available or being used. [14:25:43] sadly bacula doesn't have parsable logs [14:25:49] it is a pull kind of things [14:25:54] gotcha [14:26:07] it has logs, but they are terrible for machine analysis [14:26:19] *automatic [14:26:25] a useful metric could be the time a job need to run, to highlight issues if a job start taking much longer or shorter, it could even be conbined with the size metric to potentially notify of anomalies [14:26:55] volans: indeed, I wanted that, but I think baula doesn't have those on the cli [14:27:05] so I may have to go to the db for those [14:29:03] the problem is that backup success rate is really not that important- the imporant thing is to have fresh backups- if one has to be retried, that is still good [14:29:28] and backups will just fail for the siliest reasons (network, host maintenance, etc.) [14:29:57] I think last-success-timestamp is the most important thing [14:30:08] yeah, that is what I check on icinga [14:30:14] rates at which jobs are failing or succeeding doesn't tell you much I think [14:30:20] the "freshnes" of the last good backup with non 0 bytes [14:30:23] yeah [14:30:39] but sadly, I also need the success rate because buracoracy :-) [14:30:46] jynus: that reminds me of sth I forgot to mention during the review of bacula icinga alert, using bconsole's 'llist' instead of 'list' as the former has more information [14:30:59] ah, I may have missed that [14:31:21] I think I didn't put too much effort because for bacula check I only needed some minumum things [14:31:33] for metrics we can store much more stuff [14:31:52] indeed [14:32:11] for the check I just needed "do we have a good backup"? [14:33:06] in any case, I am not sure I will be able to use efficiently the cli [14:33:14] and may have to go to the db [14:34:09] we'll see, I will start some high level proposal and will ask for feeedback [14:34:26] I have not yet finished fully neither the migration nor the check [14:34:35] thanks to all for the ideas [14:36:16] sounds good -- you're welcome jynus !