Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Prometheus /metrics support #420

Open
MrModest opened this issue Aug 18, 2024 · 16 comments
Open

Add Prometheus /metrics support #420

MrModest opened this issue Aug 18, 2024 · 16 comments
Labels
enhancement New feature or request help wanted Extra attention is needed p2

Comments

@MrModest
Copy link

MrModest commented Aug 18, 2024

Thank you for the app! The WebUI looks nice and straightforward! I like it!

Is your feature request related to a problem? Please describe.
Even though the app provides some stats in the WebUI itself, it would be nice to be able to fetch metrics and configure custom dashboards in Grafana.

For example, I'd love to create a dashboard (based on these metrics) that shows a list of snapshots/repositories with info like last timestamp, original backup size, size after deduplication and compression, saved space ratio.

Or Timeseries that shows latency for executed backups or growing backup size.

@MrModest MrModest added the enhancement New feature or request label Aug 18, 2024
@garethgeorge
Copy link
Owner

Hey, I think Prometheus support is definitely something that should be on my roadmap.

Looking into it a bit it looks like the major metric types are: https://prometheus.io/docs/concepts/metric_types/#metric-types

I think that it'd make sense for me to export counters for each repo with names e.g.

  • repo_<repo ID>_snapshot_count
  • repo_<repo ID>_size
  • etc

And similar for each plan i.e.

  • plan_<plan ID>_snapshot_count
  • plan_<plan ID>_error_count

etc.

@garethgeorge garethgeorge added the help wanted Extra attention is needed label Aug 21, 2024
@garethgeorge
Copy link
Owner

Started work on Prometheus metrics in

https://github.com/garethgeorge/backrest/pull/459/files

Added metrics:

	commonDims := []string{"repo_id", "plan_id"}

	registry := &Registry{
		reg: prometheus.NewRegistry(),
		backupBytesProcessed: prometheus.NewSummaryVec(prometheus.SummaryOpts{
			Name: "backrest_backup_bytes_processed",
			Help: "The total number of bytes processed during a backup",
		}, commonDims),
		backupBytesAdded: prometheus.NewSummaryVec(prometheus.SummaryOpts{
			Name: "backrest_backup_bytes_added",
			Help: "The total number of bytes added during a backup",
		}, commonDims),
		backupFileWarnings: prometheus.NewSummaryVec(prometheus.SummaryOpts{
			Name: "backrest_backup_file_warnings",
			Help: "The total number of file warnings during a backup",
		}, commonDims),
		tasksDuration: prometheus.NewSummaryVec(prometheus.SummaryOpts{
			Name: "backrest_tasks_duration_secs",
			Help: "The duration of a task in seconds",
		}, append(slices.Clone(commonDims), "task_type")),
		tasksRun: prometheus.NewCounterVec(prometheus.CounterOpts{
			Name: "backrest_tasks_run_total",
			Help: "The total number of tasks run",
		}, append(slices.Clone(commonDims), "task_type", "status")),
		tasksErrors: prometheus.NewCounterVec(prometheus.CounterOpts{
			Name: "backrest_tasks_errors_total",
			Help: "The total number of tasks that errored",
		}, append(slices.Clone(commonDims), "task_type")),
	}

These have a number of dimensions, typically "repo_id" and "plan_id" at least, but also "task_type" and "status" for the task level metrics.

I've not actually setup prometheus for any of my machines before, interested to hear from anyone with background setting up dashboards on how this will be to work with / whether this is a good setup?

@MrModest
Copy link
Author

MrModest commented Sep 8, 2024

In my work, we usually just use a library called micrometer (in Java), so can't tell much about it, unfortunately 😅

But if you build and push a container with a test version to the registry, I can try to build a dashboard in grafana and share my experience :)

Btw, I don't see a metric for the compression ratio or is it calculated via "bytes_added" and "bytes_processed"?

Also, from the name of the metrics, it looks like all metrics only in the particular backup (snapshot?) level or it's just my misinterpretation and all of them are on repository level?

@garethgeorge
Copy link
Owner

Hey, nothing added yet for repo level stats -- that's something that can definitely be expanded on but because of how restic works, stats are only computed infrequently at the moment (each time a prune runs stats are computed if its been 30 days since the last stats check).

At the moment all of the metrics are exported at the plan level. I'll probably need to spend some time setting up prometheus and actually prototyping some dashboards to get a sense of what this will look like.

@garethgeorge
Copy link
Owner

The CI system provides preview builds e.g. with prometheus support https://github.com/garethgeorge/backrest/actions/runs/10754697192 , but they aren't dockerized so local testing means either direct install OR swapping the binary in the image with a short Dockerfile !

@garethgeorge
Copy link
Owner

garethgeorge commented Sep 13, 2024

Initial prometheus metrics support went out in 1.5.0 , docs aren't written up yet / this is largely a preview as I may rename or redefine a few of these. Definitions can be found in the PR #459 and may change in the next release.

Note: normal authentication applies to the /metrics endpoint so you'll want to disable auth to use this feature.

@MrModest
Copy link
Author

MrModest commented Sep 21, 2024

Sorry for the long reply. Sometimes it's very hard to find a free time :D
And thank you for the test version in the docker hub.

I toyed the test version and here're my findings.

  • The size of the backing up folders:
    • /home - 605.4 MiB
    • /mnt/pools/fast/apps-data - 6.1 GiB
    • /mnt/pools/slow/backups/db_dumps - 3.1 GiB
    • So, the total is 6.1 + 3.1 + (605.4/1024) ~ 9.79 GiB
  • The size of my local repo is 9.2 GiB.

(All size measurements made with ncdu v1.15.1)

Some info from the restic CLI from inside the container:

backrest:/# restic-0.17.0 snapshots -r /repos/main
repository e4fc16be opened (version 2, compression level auto)
ID        Time                 Host        Tags                                           Paths                                    Size
--------------------------------------------------------------------------------------------------------------------------------------------
7764c799  2024-08-18 19:25:38  backrest    plan:main__local__daily,created-by:HomeServer  /hostfs/home                             2.113 GiB
                                                                                          /hostfs/mnt/pools/fast/apps-data
                                                                                          /hostfs/mnt/pools/slow/backups/db_dumps

0bc578c9  2024-08-31 09:32:18  backrest    plan:main__local__daily,created-by:HomeServer  /hostfs/home                             1.460 GiB
                                                                                          /hostfs/mnt/pools/fast/apps-data
                                                                                          /hostfs/mnt/pools/slow/backups/db_dumps

03837cf1  2024-09-01 11:58:19  backrest    plan:main__local__daily,created-by:HomeServer  /hostfs/home                             1.748 GiB
                                                                                          /hostfs/mnt/pools/fast/apps-data
                                                                                          /hostfs/mnt/pools/slow/backups/db_dumps

bccbf5ea  2024-09-08 02:00:01  backrest    plan:main__local__daily,created-by:HomeServer  /hostfs/home                             3.943 GiB
                                                                                          /hostfs/mnt/pools/fast/apps-data
                                                                                          /hostfs/mnt/pools/slow/backups/db_dumps

f5a95a68  2024-09-12 02:00:01  backrest    plan:main__local__daily,created-by:HomeServer  /hostfs/home                             5.509 GiB
                                                                                          /hostfs/mnt/pools/fast/apps-data
                                                                                          /hostfs/mnt/pools/slow/backups/db_dumps

59a55178  2024-09-13 02:00:01  backrest    plan:main__local__daily,created-by:HomeServer  /hostfs/home                             5.907 GiB
                                                                                          /hostfs/mnt/pools/fast/apps-data
                                                                                          /hostfs/mnt/pools/slow/backups/db_dumps

cbd8d84c  2024-09-14 02:00:01  backrest    plan:main__local__daily,created-by:HomeServer  /hostfs/home                             6.316 GiB
                                                                                          /hostfs/mnt/pools/fast/apps-data
                                                                                          /hostfs/mnt/pools/slow/backups/db_dumps

8ddf324e  2024-09-15 02:00:01  backrest    plan:main__local__daily,created-by:HomeServer  /hostfs/home                             6.707 GiB
                                                                                          /hostfs/mnt/pools/fast/apps-data
                                                                                          /hostfs/mnt/pools/slow/backups/db_dumps

8e115dcb  2024-09-16 02:00:01  backrest    plan:main__local__daily,created-by:HomeServer  /hostfs/home                             7.110 GiB
                                                                                          /hostfs/mnt/pools/fast/apps-data
                                                                                          /hostfs/mnt/pools/slow/backups/db_dumps

ad4fbbc9  2024-09-17 02:00:01  backrest    plan:main__local__daily,created-by:HomeServer  /hostfs/home                             7.508 GiB
                                                                                          /hostfs/mnt/pools/fast/apps-data
                                                                                          /hostfs/mnt/pools/slow/backups/db_dumps

158fa862  2024-09-18 02:00:01  backrest    plan:main__local__daily,created-by:HomeServer  /hostfs/home                             7.903 GiB
                                                                                          /hostfs/mnt/pools/fast/apps-data
                                                                                          /hostfs/mnt/pools/slow/backups/db_dumps

f57dac83  2024-09-19 02:00:01  backrest    plan:main__local__daily,created-by:HomeServer  /hostfs/home                             8.299 GiB
                                                                                          /hostfs/mnt/pools/fast/apps-data
                                                                                          /hostfs/mnt/pools/slow/backups/db_dumps

4b990bfe  2024-09-20 02:00:01  backrest    plan:main__local__daily,created-by:HomeServer  /hostfs/home                             8.696 GiB
                                                                                          /hostfs/mnt/pools/fast/apps-data
                                                                                          /hostfs/mnt/pools/slow/backups/db_dumps

e60c8e56  2024-09-21 02:00:01  backrest    plan:main__local__daily,created-by:HomeServer  /hostfs/home                             8.799 GiB
                                                                                          /hostfs/mnt/pools/fast/apps-data
                                                                                          /hostfs/mnt/pools/slow/backups/db_dumps
--------------------------------------------------------------------------------------------------------------------------------------------
14 snapshots
backrest:/# restic-0.17.0 stats -r /repos/main
repository e4fc16be opened (version 2, compression level auto)
[0:00] 100.00%  1 / 1 index files loaded
scanning...
Stats in restore-size mode:
     Snapshots processed:  14
        Total File Count:  78315
              Total Size:  82.017 GiB

Metric browser in Grafana shows these metrics:
image

I tried to create several panels:

  1. For backrest_backup_bytes_added_sum
image
  1. For backrest_backup_bytes_processed_sum
image
  1. For backrest_tasks_duration_secs_sum
image

@MrModest
Copy link
Author

I don't fully understand which values are supposed to be shown by these metrics. The first panel doesn't seem to show the accurate repo size, and the second one I don't understand what I suppose to get from it 😵‍💫

@MrModest
Copy link
Author

Almost forgot, the stats from Backrest's UI:

main__local repo

image

main__mailru-webdav repo

image

@MrModest
Copy link
Author

MrModest commented Sep 21, 2024

More details about my setup I shared in one of other issues: #457

The only difference since then, the version is v1.5.0

@garethgeorge
Copy link
Owner

Thanks for experimenting with this -- appreciate it and love seeing the graphs.

Looks like I made a few mistakes when defining metrics -- it seems like a lot of the metrics I exported are sums, but a gauge seems like it would be more appropriate for bytes added, bytes processed, and task duration based on how it's appearing in charts.

I also agree that some more metrics need exporting. It would make sense for Backrest to export the stats that it collects whenever executing a stats operation (the caveat I'd add here is that I'm not sure how likely it is that they'll get scraped reliably? stats is a very infrequently run operation).

@MrModest
Copy link
Author

MrModest commented Oct 25, 2024

Found this project that also relies on the restic and implements Prometheus metrics. Maybe you can learn something from their source code or get some inspirations for metrics?

https://github.com/netinvent/npbackup?tab=readme-ov-file#monitoring

@nlsrchtr
Copy link

Hi @MrModest,

I'm struggeling setting up an alarm with Prometheus, based on the exposed metrics. I thought I can use the backrest_backup_file_warnings metric, but don't have a good idea. Since the backups could shrink over time as well, the size of the backup doesn't seem to be a good indicator for me.

Could you help me out here?

P.S.: Would you be able to share your Grafana dashboards?

@MrModest
Copy link
Author

MrModest commented Dec 11, 2024

Hi @nlsrchtr I don't have any alerts so far. Just dashboards: https://gist.github.com/MrModest/3dd90ed388456886e09e6c18fb6a358f

But, TBH, I haven't checked it for a long time, so I wouldn't say that they make any sense :D

For example, the 1st one is definitely lying (if compare to screenshot from backrest itself):
image

I hope that @garethgeorge will add a gauge metrics, so it will be easier to monitor. I don't see much value in counters in this scenario :D

@garethgeorge
Copy link
Owner

Hey, prometheus metrics definitely are a feature that still need some love. Main blocker at the moment is I just haven't had time to setup a prometheus install on my system to create my own configuration and iterate on making the exported data more useful.

Perhaps I can find some time soon to borrow @MrModest 's configuration and mess with this. I'm fairly heads down on #562 when I have time for backrest work, so I'd also be very happy to take PRs on the prometheus front if there's something you can pinpoint that needs changing about how backrest exports its metrics.

Metrics are defined in

package metric
import (
"net/http"
"slices"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
globalRegistry = initRegistry()
)
func initRegistry() *Registry {
commonDims := []string{"repo_id", "plan_id"}
registry := &Registry{
reg: prometheus.NewRegistry(),
backupBytesProcessed: prometheus.NewSummaryVec(prometheus.SummaryOpts{
Name: "backrest_backup_bytes_processed",
Help: "The total number of bytes processed during a backup",
}, commonDims),
backupBytesAdded: prometheus.NewSummaryVec(prometheus.SummaryOpts{
Name: "backrest_backup_bytes_added",
Help: "The total number of bytes added during a backup",
}, commonDims),
backupFileWarnings: prometheus.NewSummaryVec(prometheus.SummaryOpts{
Name: "backrest_backup_file_warnings",
Help: "The total number of file warnings during a backup",
}, commonDims),
tasksDuration: prometheus.NewSummaryVec(prometheus.SummaryOpts{
Name: "backrest_tasks_duration_secs",
Help: "The duration of a task in seconds",
}, append(slices.Clone(commonDims), "task_type")),
tasksRun: prometheus.NewCounterVec(prometheus.CounterOpts{
Name: "backrest_tasks_run_total",
Help: "The total number of tasks run",
}, append(slices.Clone(commonDims), "task_type", "status")),
}
registry.reg.MustRegister(registry.backupBytesProcessed)
registry.reg.MustRegister(registry.backupBytesAdded)
registry.reg.MustRegister(registry.backupFileWarnings)
registry.reg.MustRegister(registry.tasksDuration)
registry.reg.MustRegister(registry.tasksRun)
return registry
}
func GetRegistry() *Registry {
return globalRegistry
}
type Registry struct {
reg *prometheus.Registry
backupBytesProcessed *prometheus.SummaryVec
backupBytesAdded *prometheus.SummaryVec
backupFileWarnings *prometheus.SummaryVec
tasksDuration *prometheus.SummaryVec
tasksRun *prometheus.CounterVec
}
func (r *Registry) Handler() http.Handler {
return promhttp.HandlerFor(r.reg, promhttp.HandlerOpts{})
}
func (r *Registry) RecordTaskRun(repoID, planID, taskType string, duration_secs float64, status string) {
if repoID == "" {
repoID = "_unassociated_"
}
if planID == "" {
planID = "_unassociated_"
}
r.tasksRun.WithLabelValues(repoID, planID, taskType, status).Inc()
r.tasksDuration.WithLabelValues(repoID, planID, taskType).Observe(duration_secs)
}
func (r *Registry) RecordBackupSummary(repoID, planID string, bytesProcessed, bytesAdded int64, fileWarnings int64) {
r.backupBytesProcessed.WithLabelValues(repoID, planID).Observe(float64(bytesProcessed))
r.backupBytesAdded.WithLabelValues(repoID, planID).Observe(float64(bytesAdded))
r.backupFileWarnings.WithLabelValues(repoID, planID).Observe(float64(fileWarnings))
}
and types can be tweaked easily -- I wouldn't consider the prometheus metrics to be stable yet so breaking changes here are fine.

This approach is pretty good for exporting info about task runs e.g. backups, forgets, etc. But it's harder for infrequent operations i.e. prune or stats commands.

@titilambert
Copy link
Contributor

Hello ! I'm also trying to put some alerts and do some grafana graph with the metrics. (That's why I make de PR #625)
But I have found another issue. So I went across the metric.go and I was wondering why you choose to use SummaryVec instead of GaugeVec. Reading this doc, https://prometheus.io/docs/tutorials/understanding_metric_types/, I would use Gauge. And Update the value for each task.
Then you will have an exact state of your backup.

If you're open to this change I can make the PR.
Thanks for your response @garethgeorge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed p2
Projects
None yet
Development

No branches or pull requests

4 participants