Centralized monitoring dashboard with alerts for Veeam B&R.
Works with community edition.
Relatively easily adjusted to any backup solution that can report basic info.
A powershell script periodically runs on machines running VBR,
gathering information about backup-jobs and repositories.
This info gets pushed to a prometheus pushgateway, where it gets scraped
in to prometheus.
Grafana dashboard then visualizes the gathered information.
- VBR is installed on a windows machine. Can be physical or virtual.
- It needs a repository where to store backups. Can be local drives, network storage, cloud,..
- Job logs are in
C:\ProgramData\Veeam\Backup
- Various types of jobs are created that regularly run, creating backups.
For Hyper-V / VMware.
Veeam has admin credentails for the hypervisor.
It initiates the backup process at schedule, creates a snapshot of a VM,
process the VM's data, copies them in to a repository, deletes the snapshot.
VM's data are stored in a single file, vbk
for full backup,
vib
for incremental backup.
Veeam by default creates weekly
synthetic full backup,
which combines previous backups in to a new standalone vbk
.
For network shares, called also just File Backup
.
Differs from VM backup in a way files are stored, no vbk and vib files,
but bunch of vblob
files.
Also, long term retention requires an archive repository,
not available in community edition.
For physical machines, intented for the ones that run 24/7
and should be always accessible by Veeam.
Very similar to VMs backup. The VBR server initiates the backup,
the agent that is installed on the machine creates VSS snapshot,
and data end up in a repository, either in a vbk
file or vib
file.
Intended for use with workstations that dont have regular connectivity
with the VBR server. VBR installs an agent on the machine,
hands it XML configuration, a backup policy, that tells it how and where
to regularly backup and then its hands off, the agent is in charge.
Veeam periodically tries to sync the current policy settings with the already
deployed agents during protection group rescans.
This one was bit tricky to monitor, as job's history contains not just backup sessions, but also the policy updates. Some extra steps are needed in the powershell script to get backup runs without policy updates.
Here is a guide-by-example for monitoring using Prometheus, Grafana, Loki. Might be useful as it goes in to more details.
/home/
└── ~/
└── docker/
└── veeam_monitoring/
├── 🗁 grafana_data/
├── 🗁 prometheus_data/
├── 🗋 .env
├── 🗋 docker-compose.yml
└── 🗋 prometheus.yml
grafana_data/
- a directory where grafana stores its dataprometheus_data/
- a directory where prometheus stores its database and data.env
- a file containing environment variables for docker composedocker-compose.yml
- a docker compose file, telling docker how to run the containersprometheus.yml
- a configuration file for prometheus
The 3 files must be provided.
The directories are created by docker compose on the first run.
Three containers to spin up.
- Prometheus - prometheus server, pulling, storing, evaluating metrics.
- Pushgateway - web server ready to receive pushed information.
- Grafana - web GUI visualization of the collected metrics in nice dashboards.
Of note for prometheus container is data retention set to 45 days,
and admin api being enabled.
Pushgateway has admin api enabled too, to be able to execute wipes.
docker-compose.yml
services:
prometheus:
image: prom/prometheus:v2.43.1
container_name: prometheus
hostname: prometheus
restart: unless-stopped
user: root
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=45d'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus_data:/prometheus
ports:
- "9090:9090"
grafana:
image: grafana/grafana:9.5.2
container_name: grafana
hostname: grafana
restart: unless-stopped
env_file: .env
user: root
volumes:
- ./grafana_data:/var/lib/grafana
ports:
- "3000:3000"
pushgateway:
image: prom/pushgateway:v1.5.1
container_name: pushgateway
hostname: pushgateway
restart: unless-stopped
command:
- '--web.enable-admin-api'
ports:
- "9091:9091"
networks:
default:
name: $DOCKER_MY_NETWORK
external: true
.env
# GENERAL
DOCKER_MY_NETWORK=caddy_net
TZ=Europe/Bratislava
# GRAFANA
GF_SECURITY_ADMIN_USER=admin
GF_SECURITY_ADMIN_PASSWORD=admin
GF_USERS_ALLOW_SIGN_UP=false
GF_SERVER_ROOT_URL=https://grafana.example.com
# GRAFANA EMAIL SETTINGS
GF_SMTP_ENABLED=true
GF_SMTP_HOST=smtp-relay.sendinblue.com:587
GF_SMTP_USER=example@gmail.com
GF_SMTP_PASSWORD=xzu0dfFhn3eqa
startTLS_policy=NoStartTLS
# GRAFANA CUSTOM SETTINGS
# DATE FORMATS SWITCHED TO NAMES OF THE DAYS OF THE WEEK
#GF_DATE_FORMATS_INTERVAL_HOUR = dddd
#GF_DATE_FORMATS_INTERVAL_DAY = dddd
The containers must be on a custom named docker network,
along with caddy reverse proxy. This allows hostname resolution.
The network name is set in the .env
file, in DOCKER_MY_NETWORK
variable.
If one does not exist yet: docker network create caddy_net
In the .env
file, there are also two date settings for grafana commented out.
Uncomment to show full name of days in the week instead of exact date.
A config file for prometheus, bind mounted in to the prometheus container.
Of note is honor_labels set to true,
which means that conflicting labels, like job
, set during push
are kept over labels set by prometheus.yml
for that scrape job.
Docs.
prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'pushgateway-scrape'
scrape_interval: 60s
honor_labels: true
static_configs:
- targets: ['pushgateway:9091']
Caddy v2 is used, details
here.
Caddyfile
grafana.{$MY_DOMAIN} {
reverse_proxy grafana:3000
}
push.{$MY_DOMAIN} {
reverse_proxy pushgateway:9091
}
# prom.{$MY_DOMAIN} {
# reverse_proxy prometheus:9090
# }
docker compose up -d
- First run login with admin/admin.
- In Preferences > Datasources set
http://prometheus:9090
for url.
Save and test should be green. - Once some metrics are pushed to prometheus, they should be searchable in Explore section in Grafana.
A section written during first testing
what should work at this moment
- <docker-host-ip>:3000 - grafana
- <docker-host-ip>:9090 - prometheus
- <docker-host-ip>:9091 - pushgateway
- metrics must be floats
- naming convention is to end the metric names with units
- labels in url are used to pass strings info and to mark the metrics
- The idea what job and instance represent. In pushgateway I guess the job is still just overal main idea and instance is about final unique, err instance.
Prometheus requires linux line endings.
The "`n" in the $body
is to simulate it in windows powershell.
Also in powershell the grave(backtick) character - `
is for escaping stuff
Here it is also used to escape new line. This allows breaking a command
in to multiple easier to read lines.
Though it caused issues, introducing space where it should not be,
thats why -uri
is always full length in the final script.
God damn fragile powershell.
test.ps1
$body = "storage_diskC_free_space_bytes 32`n"
Invoke-RestMethod `
-Method PUT `
-Uri "http://10.0.19.4:9091/metrics/job/veeam_report/instance/PC1" `
-Body $body
- in the $body we have name of the metrics -
storage_diskC_free_space_bytes
and the value of that metrics -32
- in the url, after
10.0.19.4:9091/metrics/
, we have two labels defined
job=veeam_report
andinstance=PC1
note the pattern, name of a label and value of it, they always must be in pair. They can be named whatever, butjob
andinstance
are customary
Heres how the data look in prometheus when executing storage_diskC_free_space_bytes
query
The labels help us target the data in grafana.
- create new dashboard, panel
- switch type to Status history
- select metric -
storage_diskC_free_space_bytes
- query options
- min interval - 1h
- relative time - now-10h/h
- to not deal with long ugly names add transformation - Rename by regex
Match -.+instance="([^"]*).*
- explained
Replace -$1
- can also play with transparency, legend, treshold for pretty colors
should look in the end somewhat like this
extra info
Examples.
this command deletes all metrics on prometheus, assuming api is enabled
curl -X POST -g 'http://10.0.19.4:9090/api/v1/admin/tsdb/delete_series?match[]={__name__=~".*"}'
So theres the proof of concept of being able to send data to pushgateway and visualize them in grafana
Here's
my basic understanding.
How prometheus stores data, how to query, difference between instant vector
and range vector, some links.
The Script: veeam_prometheus_info_push.ps1
The script itself should be pretty informative with the comments in it.
Tested with VBR v12
Might work with v11, except for agent-based backups as there were bugs
in new cmdlets in that version.
Changelog
- v0.4
- added $ErrorActionPreference = "Stop" which will terminate script's execution on any error
- job run time window calculation changed from the endtime to startime
- detection of a job being a full backup is now separate part and done after the backup ends
- v0.3
- huge rewrite
- v0.2
- added pushing of repository disk usage info
- changed metrics name to include units
- general cleanup
- v0.1 - the initial script
Veeam is now warning with every use of Get-VBRJob
cmdlet that
future versions will not be returning agent-based backup jobs.
So to avoid tech debt, the script uses
Get-VBRComputerBackupJob
and Get-VBRComputerBackupJobSession
and got bigger and messier because of it, but should be more ready for that future.
- 0 = success
- 1 = warning
- 2 = failed
- -1 = running
- -11 = running full backup or full synthetic backup
- 99 = disabled or not scheduled
The double digit ones are addition by the script.
Also agent based backups needed a rewrite of their return values,
as they used different ones.
This visualization of runs is not precise, can be shifted some on the time line, but it should be enough for general overview.
Job themselves report that they are running but this can miss short running jobs.
So in adition the script checks the last jobs start time,
if it was within the last hour the result code is set to -1
. So every job
is shown at least 1 hour long.
This also means that if the script would be scheduled to run periodically
at intervals longer than an hour, it might miss runs.
The default deployment is every 30 minutes.
Until a job is finished we dunno if a run was a full backup or a full syntenthic,
so there is also check of the last end time. If it was within the last hour
and it was full/syntenthic_full, it changes the result code to -11
.
- Data size - The size of the data being backedup.
There is an issue of being unable to get the correct size for agent based backups that target specific folders. If the backup target would be entire machine or a partition, the data would be correct.
To get at least some approximation, the size of the last vbk file is used, multiplied by1.3
to account for some compression. - Backup size - the combined size of all backups of the job.
To ease the deployment.
- Download this repo.
- Extract.
- Edit
veeam_prometheus_info_push.ps1
set$BASE_URL
and$GROUP
name. - Run
DEPLOY.cmd
as an administrator. - Done.
What happens under the hood:
- DEPLOY.cmd - checks if it runs as an administrator, ends if not.
- DEPLOY.cmd - creates directory
C:\Scripts
if it does not exists. - DEPLOY.cmd - checks if the script already exists, if it does, renames it by adding a random suffix.
- DEPLOY.cmd - copies veeam_prometheus_info_push.ps1 in to
C:\Scripts
. - DEPLOY.cmd - imports taskscheduler xml task named veeam_prometheus_info_push.
- TASKSCHEDULER - the task executes every 30 minutes, at xx:15 and xx:45, with random delay of 30 seconds.
- TASKSCHEDULER - the task runs with the highest privileges as user - SYSTEM (S-1-5-18).
- DEPLOY.cmd - enables powershell scripts execution on that windows PC.
- DEPLOY.cmd -
Unblock-File
to allow the script execution when not created localy.
Pushed data can be checked On Pushgateway's url.
To delete all data from pushgateway
- from web interface theres a button
curl -X PUT 10.0.19.4:9091/api/v1/admin/wipe
curl -X PUT https://push.example.com/api/v1/admin/wipe
Without any action the pushed metrics sit on the pushgateway forever.
This is intentional.
It is essential to wipe pushgateway clean daily to better visualize lack of
new reports coming in.
For this the dockerhost can have a simple systemd service and a timer.
How to setup systemd pushgateway_wipe.service
In /etc/systemd/system/
pushgateway_wipe.service
[Unit]
Description=wipe clean prometheus pushgateway
[Service]
Type=simple
ExecStart=curl -X PUT https://push.example.com/api/v1/admin/wipe
pushgateway_wipe.timer
[Unit]
Description=wipe clean prometheus pushgateway
[Timer]
OnCalendar=00:19:00
[Install]
WantedBy=timers.target
enable the timer: sudo systemctl enable pushgateway_wipe.timer
In the compose file the data retention is set to 45 days.
--storage.tsdb.retention.time=45d
Not much really to do once it runs. Checking values can be done through grafana,
and for deletion one needs to use api.
But still, one can access its web gui from LAN side with <dockerhost>:9090
,
or can setup web access to it from the outside like for grafana and pushgateway.
Official documentation on queries
To query something just write plain metrics name, like veeam_job_result_info
.
In the table tab it shows result from a recent time window. Switching to graph
tab allows larger time range.
More targeted query, with the use of regex, signified by =~
veeam_job_result_info{instance=~"Backup Copy Job.*"}
To delete all metrics on prometheus
curl -X POST -g 'http://10.0.19.4:9090/api/v1/admin/tsdb/delete_series?match[]={__name__=~".*"}'
To delete metrics of an instance or group
curl -X POST -g 'https://prom.example.com/api/v1/admin/tsdb/delete_series?match[]={instance=~"^Backup.Copy.Job.*"}'
curl -X POST -g 'https://prom.example.com/api/v1/admin/tsdb/delete_series?match[]={group=~"CocaCola"}'
Theres no white space in the query, so dots are used.
The json file in this repo can be imported in to grafana.
- VBR_dashboard_v2.json
- Dashboards > New > Import > paste json
Changelog
- v2 - changed the initial time ranges, fixed last run and last report times
- v1 - the initial dashboard
To set the dashboard to be shown right away when visiting the domain
User (right top corner) > Profile > Home Dashboard > Set > Save
The first panel is for seeing last X days backup history, at quick glance
- Visualization = Status history
- Data source = Prometheus
- Query, switch from builder to code
veeam_job_result_info{job="veeam_job_report"}
- Query options > Min interval = 1h
This sets the "resolution" of status history panel,
but data are renewed by default only every 30min.
During the first setup something smaller like 10min looks good. - two ways to have nice labels
- Query > Options > Legend > switch from
Auto
toCustom
Legend ={{name}} | {{group}}
- Transform > Rename by regex
Match =.+group="([^"]*).+instance="([^"]*).*
Replace =$2 | $1
- Query > Options > Legend > switch from
- Panel > title =
Veeam Status History
- Status history > Show values = never
- Legend > Visibility = off
- Value mapping
- 0 = Successful; Green
- 1 = Warning; Yellow
- 2 = Failed; Red
- -1 = Running; Blue
- -11 = Full Backup; Purple
- 99 = Disabled | Unscheduled; Grey
This panel shows how full repositories are.
Unfortunately grafana is not as capable as I hoped. While their example
shows
exactly what I wanted, they cheated by picking the same max value for all disks.
So no nice GB and TB info, just percent.
Tried to float
the idea of maybe addressing this in their discussion on github.
- Visualization = Bar gauge
- Data source = Prometheus
- Query, switch from builder to code
(veeam_repo_total_size_bytes{job="veeam_repo_report"} - veeam_repo_free_space_bytes{job="veeam_repo_report"}) / ((veeam_repo_total_size_bytes{job="veeam_repo_report"}) /100)
- Query > Options > Legend > switch from
Auto
toCustom
Legend ={{name}} | {{server}} | {{group}}
- Panel > title =
Repositories Disk Use
- Bar gauge > Display mode > Basic
- Standard options > Unit = Misc > Percent (0-100)
- Standard options > Min = 0
- Standard options > Max = 100
- Standard options > Decimals = 0
- Standard options > Display Name =
${__field.displayName}
Needed if only one repository, to show the name under the bar. - Thresholds
- 90 = red
- 75 = Yellow
- base = green
This panel is a table with more details about jobs.
- Visualization = Table
- Data source = Prometheus
- Query, switch from builder to code
veeam_job_result_info{job="veeam_job_report"}
- Query options > Format = Table
- Query options > Format = Table
- This results in a table where each job's last result is shown,
plus labels and their values.
One could start cleaning it up with a Transform, but there are other metrics missing and the time stuff is in absolute values instead of x minutes/hours ago.
So before cleaning, more mess will be added. - Rename the original query
from
A
toresult
.
This renaming will be used in all following queries so that the fields are distinguishable in transformation later. - Create following queries, the first line is the new name,
the second is the query code itself.
Every query has in Options > Type set to table.data_size
veeam_job_data_size_bytes{job="veeam_job_report"}
backup_size
veeam_job_backup_size_bytes{job="veeam_job_report"}
restore_points
veeam_job_restore_points_total{job="veeam_job_report"}
job_runtime
veeam_job_end_time_timestamp_seconds{job="veeam_job_report"} - veeam_job_start_time_timestamp_seconds{job="veeam_job_report"}
last_job_run
time()-last_over_time(veeam_job_end_time_timestamp_seconds{job="veeam_job_report"}[30d])
last_report
time()-last_over_time(push_time_seconds{job="veeam_job_report"}[30d])
- Now the results are there in many tables, switchable from a drop down menu, but they need to be combined in to one table.
- Transform > Join by field > Mode = OUTER; Field = instance
- Now theres one long table with lot of duplication as every query brought labels again. Now to clean it up.
- Transform > Organize fields
- Hide unwanted fields
Hiding anything with number 2, 3, 4, 5, 6, 7 in name works to get bulk of it gone - Rename headers for fields that are kept.
- Reorder with drag and drop.
- Hide unwanted fields
- Panel options > Title =
Job's Details
- Thresholds > delete whatever is there; set Base to be transparent
- Now the table will be modified using overrides
So that columns can be targeted separatly. - Overrides
- Fields with name matching regex =
/Last Run|Runtime|Last Report/
Standard options > Unit =seconds (s)
Standard options > Decimals =0
- Fields with name matching regex =
/Data Size|Backup Size/
Standard options > Unit =bytes(SI)
- Fields with name =
Result
> Value mappings
- Value Mapping:
- 0 = Successful; Green
- 1 = Warning; Yellow
- 2 = Failed; Red
- -1 = Running; Blue
- -11 = Full Backup; Purple
- 99 = Disabled | Unscheduled; Grey
- the colors should be muted by transparency ~0.4
- Cell options > Cell type
Colored background
Gradient
- Value Mapping:
- Fields with name =
Group
> Value mappings
- Value Mapping:
- 0 = water; Green
- 1 = CocaCola; Yellow
- 2 = beer; Red
- the colors should be muted by transparency ~0.3
- Cell options > Cell type
Colored background
Gradient
- Value Mapping:
- Save and look.
- Adjusting column width will be creating overrides for that column.
Just to be aware, as it might be weird seeing like 12 overrides afterwards.
Grafana alerts help with the reliability and danger of a failure going unnoticed.
Especially considering the dynamic nature of this setup, meaning that if reporting
stops for any reason, after some time there is no indication that a job
even existed, let alone failed.
Before getting to alerts, first the delivery mechanism and policy.
Grafana > Alerting > Contact points
Just needs corectly set some smtp stuff in the .env
file for grafana,
as can be seen in the setup section.
The contact point already exists, named grafana-default-email
.
Can be tested if it actually works when editing the contact point.
Push notifications for a phone or desktop using selfhosted ntfy.
Detailed setup of running ntfy as a docker container
here.
- New contact point
- Name =
ntfy
- Integration =
Webhook
- URL =
https://ntfy.example.com/veeam
or if grafana-to-ntfy is already setup on the same docker network, then URL =http://grafana-to-ntfy:8080
- plain ntfy does not need credentials,
grafana-to-ntfy needs the ones from its.env
file set. - Disable resolved message = check
- Test
- Save
Issue I noticed now in testing with ntfy, is that if you get multiple failures
it wont deliver. Could be solved by not letting it send the complex grafana
json full of dynamic values, but just some generic static text about a failure.
Will eventually look in to it, or report it to the dev.
Editing the Default policy
, making sure the contact point is the correct one
is enough if just one contact point is planned to be used. Like just email.
Of note are Timing options
inside policy, that sets how often a firing alarm
will resend notification. Default is 4h, +5m for group interval.
To fire notification on multiple contact points,
for alerts in veeam_alerts
folder:
- Within the
Default policy
adding+ New nested policy
. - Matching labels:
grafana_folder
=
veeam_alerts
SelectContact point
-grafana-default-email
Enable -Continue matching subsequent sibling nodes
Which means that after matching, it will continue to look for other policies that would also match - Do the same again for a new nested policy, but use contact point to
ntfi
.
The Default policy
is applied only if no other policy fits.
Currently these alerts are not long term tested.
They should work, but should be considered in development.
- 1 Set an alert rule name
- Rule name =
veaam_backup_failed_or_warning
- Rule name =
- 2 Set a query and alert condition
- A - Prometheus; set Last 2d
- Options > Min step = 15m
- switch from builder to code
veeam_job_result_info{job="veeam_job_report"}
- B - Reduce
- Function = Last
- Input = A
- Mode = Strict
- C - Treshold
- Input = B
- is within range 0 to 3 (it's not inclusive)
- Make this the alert condition
- A - Prometheus; set Last 2d
- 3 Alert evaluation behavior
- Folder = "veeam_alerts"
- Evaluation group (interval) = "one_hour"
- Evaluation interval = 1h
- For = 0s
- Configure no data and error handling
- Alert state if no data or all values are null = OK
- 4 Add details for your alert rule
- Metrics labels can be used here
- 5 Notifications
- nothing
- Save and exit
- 1 Set an alert rule name
- Rule name =
veaam_repo_full
- Rule name =
- 2 Set a query and alert condition
- A - Prometheus; set Last 2d
- Options > Min step = 15m
- switch from builder to code
(veeam_repo_total_size_bytes{job="veeam_repo_report"} - veeam_repo_free_space_bytes{job="veeam_repo_report"}) / ((veeam_repo_total_size_bytes{job="veeam_repo_report"}) /100)
- B - Reduce
- Function = Last
- Input = A
- Mode = Strict
- C - Treshold
- Input = B
- is above
84
- Make this the alert condition
- A - Prometheus; set Last 2d
- 3 Alert evaluation behavior
- Folder = "veeam_alerts"
- Evaluation group (interval) = "one_hour"
- Evaluation interval = 1h
- For = 0s
- Configure no data and error handling
- Alert state if no data or all values are null = OK
- 4 Add details for your alert rule
- Metrics labels can be used here
- 5 Notifications
- nothing
- Save and exit
- 1 Set an alert rule name
- Rule name =
veaam_noreport_five_days
- Rule name =
- 2 Set a query and alert condition
- A - Prometheus; set Last 30 days (now-30d to now)
- switch from builder to code
time()-last_over_time(push_time_seconds{job="veeam_job_report"}[30d])
- switch from builder to code
- B - Reduce
- Function = Last
- Input = A
- Mode = Strict
- C - Treshold
- Input = B
- is above
432000
- Make this the alert condition
- A - Prometheus; set Last 30 days (now-30d to now)
- 3 Alert evaluation behavior
- Folder = "veeam_alerts"
- Evaluation group (interval) = "twelve_hours"
- Evaluation interval = 12h
- For = 0s
- Configure no data and error handling
- Alert state if no data or all values are null = Error
- 4 Add details for your alert rule
- nothing
- 5 Notifications
- nothing
- Save and exit
- 1 Set an alert rule name
- Rule name =
veaam_nobackup_five_days
- Rule name =
- 2 Set a query and alert condition
- A - Prometheus; set Last 30 days (now-30d to now)
- switch from builder to code
time()-last_over_time(veeam_job_end_time_timestamp_seconds{job="veeam_job_report"}[30d])
- switch from builder to code
- B - Reduce
- Function = Last
- Input = A
- Mode = Strict
- C - Treshold
- Input = B
- is above
432000
- Make this the alert condition
- A - Prometheus; set Last 30 days (now-30d to now)
- 3 Alert evaluation behavior
- Folder = "veeam_alerts"
- Evaluation group (interval) = "twelve_hours"
- Evaluation interval = 12h
- For = 0s
- Configure no data and error handling
- Alert state if no data or all values are null = Error
- 4 Add details for your alert rule
- Metrics labels can be used here
nothing
- Metrics labels can be used here
- 5 Notifications
- nothing
- Save and exit