[09:28:28] <godog>	 tomorrow's pad is up!
[14:06:59] <jynus>	 I got asked (indirectly) to create a prometheus exporter of bacula, and I would like to have a discussion, if possible, a on how to best organize that- not time sensitive (before I start to build anything)
[14:08:08] <cdanis>	 what stats does the bacula master provide already?
[14:08:31] <jynus>	 no metrics at all
[14:08:48] <jynus>	 my questions is to gather ideas on how to translate info into metrics that makes sense
[14:09:06] <jynus>	 e.g. a label per job
[14:09:20] <jynus>	 or some other organization
[14:09:29] <cdanis>	 how many bacula jobs do we have, approx?
[14:09:34] <jynus>	 92
[14:09:54] <cdanis>	 oh okay, i was worried that it was thousands
[14:10:21] <jynus>	 so I just have a lot of questions on how to best organize that
[14:10:57] <jynus>	 also if I should cache because queries could take a long time, or just create a wrapper around the db, metric organization, etc.
[14:11:28] <cdanis>	 how long is a long time? :)
[14:11:45] <jynus>	 currently icinga takes around 16 seconds
[14:12:04] <jynus>	 I could do it much faster, but I would relay on internal db structure, which I wanted to avoid
[14:12:28] <jynus>	 (the "api" [sic] is per job)
[14:13:05] <jynus>	 basically, I have a bunch of information, but not in a time series
[14:13:25] <jynus>	 so I am wondering how to "convert" into it
[14:15:22] <jynus>	 while info on bacula is mostly event-based X happened at Y time and result was Z
[14:16:47] <godog>	 from what you said it seems that periodically dumping the metrics would be best, e.g. in a plaintext file like we do for smart, it wouldn't be as general-purpose as an exporter but it'd work
[14:17:06] <godog>	 on the events I'd say a good start would be to export the unix timestamp of the last success of a job
[14:17:27] <godog>	 the unix epoch rather, being the metric value
[14:17:58] <jynus>	 "last success of a job" is ok, and I will want that
[14:18:28] <jynus>	 but the thing I got asked to have first is "backup success rate"
[14:19:35] <godog>	 ah, yeah in that case using 0 or 1 as values and "job success" as the metric would work
[14:20:03] <jynus>	 and a counter that gets increased, for tried, aborted, errored?
[14:20:11] <godog>	 e.g. bacula_job_success{job="..."} 0/1
[14:20:28] <jynus>	 oh, I don't need to count, true
[14:20:31] <cdanis>	 yeah i think a last-successful-timestamp gauge, and a last-run-status gauge
[14:20:38] <jynus>	 prometheus can do it for me
[14:21:27] <jynus>	 wait, but how do I reconcile the "job execution rate" vs the "exporter execution rate"
[14:22:03] <jynus>	 e.g. in 1 minute 0 jobs could have run, or 5
[14:22:05] <cdanis>	 if you wanted an actual rate of job success/failure you'd need counters
[14:22:58] <jynus>	 indeed
[14:23:23] <jynus>	 I will give it a think, think of concrete metrics that would be interesting
[14:23:30] <jynus>	 and then ask for feedback
[14:23:34] <herron>	 another route, if this can be easily extracted from bacula logs, is mtail
[14:23:36] <cdanis>	 i can also imagine something like
[14:23:37] <jynus>	 thanks for the suggetsions
[14:23:41] <cdanis>	 a "bytes stored" gauge
[14:23:43] <cdanis>	 for capacity planning
[14:23:57] <cdanis>	 maybe the existing filesystem metrics are enough for that, i don't know
[14:23:58] <jynus>	 yeah, that is the easy part :-)
[14:24:16] <jynus>	 because that is already "aggregated"
[14:24:21] <jynus>	 herron: what do you mean?
[14:25:34] <jynus>	 oh, I see logs-> metrics
[14:25:37] <herron>	 we use mtail in some places to gather metrics from log events, where a native exporter isn’t available or being used.
[14:25:43] <jynus>	 sadly bacula doesn't have parsable logs
[14:25:49] <jynus>	 it is a pull kind of things
[14:25:54] <herron>	 gotcha
[14:26:07] <jynus>	 it has logs, but they are terrible for machine analysis
[14:26:19] <jynus>	 *automatic
[14:26:25] <volans>	 a useful metric could be the time a job need to run, to highlight issues if a job start taking much longer or shorter, it could even be conbined with the size metric to potentially notify of anomalies
[14:26:55] <jynus>	 volans: indeed, I wanted that, but I think baula doesn't have those on the cli
[14:27:05] <jynus>	 so I may have to go to the db for those
[14:29:03] <jynus>	 the problem is that backup success rate is really not that important- the imporant thing is to have fresh backups- if one has to be retried, that is still good
[14:29:28] <jynus>	 and backups will just fail for the siliest reasons (network, host maintenance, etc.)
[14:29:57] <cdanis>	 I think last-success-timestamp is the most important thing
[14:30:08] <jynus>	 yeah, that is what I check on icinga
[14:30:14] <cdanis>	 rates at which jobs are failing or succeeding doesn't tell you much I think
[14:30:20] <jynus>	 the "freshnes" of the last good backup with non 0 bytes
[14:30:23] <cdanis>	 yeah
[14:30:39] <jynus>	 but sadly, I also need the success rate because buracoracy :-)
[14:30:46] <godog>	 jynus: that reminds me of sth I forgot to mention during the review of bacula icinga alert, using bconsole's 'llist' instead of 'list' as the former has more information
[14:30:59] <jynus>	 ah, I may have missed that
[14:31:21] <jynus>	 I think I didn't put too much effort because for bacula check I only needed some minumum things
[14:31:33] <jynus>	 for metrics we can store much more stuff
[14:31:52] <godog>	 indeed
[14:32:11] <jynus>	 for the check I just needed "do we have a good backup"?
[14:33:06] <jynus>	 in any case, I am not sure I will be able to use efficiently the cli
[14:33:14] <jynus>	 and may have to go to the db
[14:34:09] <jynus>	 we'll see, I will start some high level proposal and will ask for feeedback
[14:34:26] <jynus>	 I have not yet finished fully neither the migration nor the check
[14:34:35] <jynus>	 thanks to all for the ideas
[14:36:16] <godog>	 sounds good -- you're welcome jynus !