Proxmox Backup Validation

Automated test restores of Proxmox Backup Server (PBS) backups into isolated VMs. For each VM it restores the latest backup, boots it on an isolated bridge, validates boot → systemd services → TCP ports → HTTP endpoints, sends a Telegram report, and destroys the temporary VM.

Service checks (TCP/HTTP) run inside the guest via the QEMU guest agent, so they work even on a fully isolated bridge with no route and no DHCP.

Designed for clusters without shared storage (no Ceph): restores pull from > PBS, so a single "test node" can validate VMs from any node in the cluster.

Features

🔁 Latest-snapshot test restore per VM, then automatic teardown
🧪 Validates boot, systemd units, listening ports and HTTP endpoints
🔍 Service auto-discovery from a signature library (+ manual overrides)
🔒 Isolated bridge, firewall disabled on the test NIC, NIC config preserved
🛡️ Single-run lock (flock) and cleanup on interruption (INT/TERM)
📨 Telegram notifications (success / failure / summary)
📸 On boot failure, a console screenshot is attached to the Telegram alert (lets you see GRUB / kernel panic / fsck / emergency shell at a glance)
🗂️ Per-run, per-VM log files

How it works

For each VMID listed in backup_validation.conf, the script runs this flow on the PVE node:

backup_validation.conf                 PBS datastore (storage "pbs")
  105 ; auto   ;        ─┐
  110 ; hybrid ; ...     │   per VMID
  210 ; manual ; ...     │
                         ▼
       ┌─────────────────────────────────┐
       │ 1. List backups in PBS          │  pvesh get .../storage/pbs/content
       │ 2. Filter by VMID, content=backup│  keep vmid == 105
       │ 3. Pick the NEWEST (by ctime)   │  → volid of the latest snapshot
       └────────────────┬────────────────┘
                        ▼
       ┌─────────────────────────────────┐
       │ 4. Restore to a TEMP VMID       │  qmrestore <volid> 9105 --unique
       │    (prefix + VMID, e.g. 105→9105)│
       └────────────────┬────────────────┘
                        ▼
       ┌─────────────────────────────────┐
       │ 5. Isolate NIC → vmbr99         │  swap bridge, firewall off
       │ 6. Boot + wait for guest agent  │
       └────────────────┬────────────────┘
                        ▼
       ┌─────────────────────────────────┐
       │ 7. Validate INSIDE the guest    │  systemd / TCP / HTTP / custom
       │    boot, services, ports        │
       └────────────────┬────────────────┘
                        ▼
       ┌─────────────────────────────────┐
       │ 8. Telegram report (OK / FAIL)  │  + console screenshot on boot fail
       │ 9. Destroy the temp VM          │  qm destroy 9105 --purge
       └─────────────────────────────────┘

Key points:

The source production VM is only read — its newest PBS backup is restored into a separate temporary VMID (TEMP_VMID_PREFIX + the original VMID, e.g. 105 → 9105). Production is never touched.
"Newest" is decided by the backup's creation time (ctime); older snapshots are not tested.
The temporary VM is always destroyed at the end (and on interruption). See Safety.

Requirements

On the PVE node (where the script runs):

which qm qmrestore pvesh python3 flock curl
apt install -y curl    # if curl is missing (used for Telegram)

# Optional: only needed to convert console screenshots to PNG on boot failures,
# and only if your QEMU can't write PNG directly (PVE 8+ usually can).
apt install -y netpbm  # provides pnmtopng (or use imagemagick's `convert`)

On each Linux guest to be validated:

apt install -y qemu-guest-agent
systemctl enable --now qemu-guest-agent
# For HTTP checks, the guest also needs curl or wget (optional)

On the PVE node, make sure the agent is enabled in the VM config:

qm set <VMID> --agent enabled=1

1. Create the isolated bridge

Edit /etc/network/interfaces:

auto vmbr99
iface vmbr99 inet manual
    bridge-ports none
    bridge-stp off
    bridge-fd 0
    # No gateway, no route — fully isolated

Apply and verify:

ifreload -a
ip link show vmbr99

DHCP is NOT required. Since checks run inside the guest, the test VM does not need an IP on the isolated bridge. The guest IP, when available, is only shown in the report ("Test IP"). If you want an IP shown, you can optionally run dnsmasq on vmbr99 — but it is purely cosmetic.

2. Install the script and config

The script reads everything from /root by default:

cp backup_validation.sh /root/
chmod +x /root/backup_validation.sh

cp backup_validation.conf.example /root/backup_validation.conf         # edit it
cp backup_validation.env.example  /root/backup_validation.env          # edit it
chmod 600 /root/backup_validation.env                                  # contains secrets

Save all files with LF (Unix) line endings. CRLF is tolerated for the .conf/.env, but the .sh must be LF or the shebang breaks.

3. Configure secrets (Telegram)

Secrets and per-node overrides live in backup_validation.env (sourced at runtime), not in the script:

TELEGRAM_TOKEN="your_token_here"
TELEGRAM_CHAT_ID="your_chat_id_here"
# Optional — send to a specific topic of a forum supergroup:
#TELEGRAM_THREAD_ID="123"

TELEGRAM_THREAD_ID is the message_thread_id of a topic inside a Telegram group with topics (forum supergroup). Leave it empty to post to the main chat.

Any default from the script can be overridden here (storage names, timeouts, paths, bridge, etc.). Leave Telegram empty to disable notifications.

4. Configure the VM list

Edit /root/backup_validation.conf. One VM per line, ;-separated:

# vmid ; mode   ; overrides
105    ; auto   ;
110    ; hybrid ; cloudflared:0
210    ; manual ; myapp:8080

5. Run

# Full cycle (all VMs in the config file)
/root/backup_validation.sh

# One-off test of a single VM
/root/backup_validation.sh --vmid 105
/root/backup_validation.sh --vmid 105 --mode hybrid --overrides "cloudflared:0"
/root/backup_validation.sh --vmid 105 --mode manual --overrides "myapp:8080"

Logs are written to /var/log/backup_validation/<MM-DD-YY_HHhMM>/:

/var/log/backup_validation/06-09-26_02h00/
  ├── _cycle.log        # cycle-level: pre-checks, summary
  ├── 105/105.log       # per-VM test log
  └── 110/110.log

6. Schedule via cron

# Weekly, Sundays at 02:00
cat > /etc/cron.d/backup-validation <<'EOF'
0 2 * * 0 root /root/backup_validation.sh
EOF

Operating modes

Mode	Use	Example line
`auto`	Default services from the signature library	`100;auto;`
`hybrid`	Auto-discovery + custom services	`101;hybrid;api:8080`
`manual`	Custom services only	`102;manual;myapp:9000`

Override format: service:port,service:port. Use port 0 to skip the port/HTTP check and validate only the systemd unit (e.g. cloudflared:0).

Built-in signature library

Service	Check
Apache (`apache2`/`httpd`)	HTTP/80
Nginx	HTTP/80
PostgreSQL	TCP/5432
MariaDB/MySQL	TCP/3306
Redis	TCP/6379
MongoDB	TCP/27017
ClickHouse	TCP/9000
Docker	systemd only
Elasticsearch	HTTP/9200
RabbitMQ	TCP/5672
cloudflared	systemd + log scan

How checks work

Each signature has the form PORT:PROTOCOL:ENDPOINT. The PROTOCOL field decides which validation runs after the always-on systemd check:

Protocol	Validation	What it checks
(any)	`check_systemd`	The unit reports `active` (always runs first)
`tcp`	`check_tcp_guest`	A socket is in `LISTEN` on the port (`ss -ltn`, inside the guest)
`http` / `https`	`check_http_guest`	A request to `127.0.0.1:PORT` returns a valid HTTP code
`none`	—	systemd only (no port to probe, e.g. Docker)
`custom_*`	custom function	Service-specific logic (see below)

All network checks run inside the guest via the QEMU guest agent, so they work regardless of guest networking, DHCP or routing.

Failure diagnostics

When a check fails, the report tells you what went wrong, not just that it did:

HTTP: the actual status code is shown in the alert (e.g. ✗ HTTP 80 (nginx) [500]) and appended to the reason (HTTP_FAIL_nginx_80_500).
systemd: the unit state and reason are shown (e.g. ✗ systemd: nginx [failed — ActiveState=failed SubState=failed Result=exit-code]), and the full systemctl status + last 15 journal lines are written to the per-VM log.
Boot: a console screenshot is attached to the Telegram alert (see Features).

Why cloudflared is a special case

cloudflared (Cloudflare Tunnel) opens no local listening port — it makes outbound connections to Cloudflare's edge. So TCP/HTTP probing makes no sense. Instead, check_cloudflared inspects journalctl -u cloudflared and counts only critical errors. Because the test VM runs on an isolated bridge with no internet, connectivity errors are expected and are filtered out (network is unreachable, no such host, dial tcp, context deadline) — only real problems (bad config, invalid credentials, parse errors) cause a failure.

Adding your own custom check

The custom_* protocol is an extension hook. To validate a service that doesn't fit the port/HTTP model (a queue worker, a backup daemon, etc.):

Add a signature with a custom_<name> protocol (port 0):
```
["myservice"]="0:custom_myservice:"
```
Write a check_myservice() function (use guest_exec to run commands inside the guest, mirroring check_cloudflared).

Route it in step 7.2 of process_vm():

case "$proto" in
    custom_cloudflared) ... ;;
    custom_myservice)
        if check_myservice "$temp_vmid"; then
            checks="${checks}  ✓ My check (${svc})\n"
        else
            checks="${checks}  ✗ My check (${svc})\n"
            final_result="FAIL"; failure_reason="MYSERVICE_FAIL_${svc}"
        fi
        ;;
esac

Safety: how it avoids touching production

The whole safety model rests on VMID separation. The script never deletes a production VM because:

Production and test IDs never cross. The source VMID (e.g. 105) is only read (its PBS backup is restored). The restore writes to a separate temporary VMID built as TEMP_VMID_PREFIX + the original VMID (e.g. 105 → 9105). qm stop and qm destroy are only ever called with the temporary VMID.
The source VM is never started, stopped or destroyed — only its backup is read.
It refuses to reuse an existing VMID. Before restoring, it checks whether the temporary VMID already exists; if so it aborts/skips (TEMP_VMID_BUSY) instead of overwriting or deleting. So it only ever destroys a VM it created itself at a free VMID.
Hard guard in cleanup. cleanup_vm refuses to destroy any VMID that does not match the temp scheme (TEMP_VMID_PREFIX + a real VMID) — even a bug or misconfiguration cannot make it touch a production ID.
Visible marker. The test VM is tagged backup-validation-temp with a "safe to delete" description, so it is obvious in qm list / the web UI.

⚠️ Note: the restored test VM inherits the production VM's name (the full config is restored). The name is therefore not a safe discriminator — only the VMID is. Make sure the prefixed IDs can't collide with real VMs — e.g. with the default prefix 9, avoid having a production VM 9105 while testing 105. Keeping production VMIDs in the 100–999 range (Proxmox's convention) avoids this.

Troubleshooting

Guest agent not responding

Check systemctl status qemu-guest-agent inside the VM
Confirm qm set <VMID> --agent enabled=1 on the original VM

HTTP check fails but the service is up

The check runs against 127.0.0.1 inside the guest. If the service binds to a specific interface (not 0.0.0.0/localhost), use a tcp override instead — the TCP check uses ss -ltn and matches any listen address.

Temporary VMID busy

Look for leftovers: qm list | grep '^ *900'
Clean up manually: qm destroy <VMID> --purge 1

Restore fails

Check destination storage: pvesm status
Check free space: df -h

Another run in progress

A flock prevents overlapping runs. If a previous run is still going (long restore), the new one aborts. Lock file: /var/lock/backup_validation.lock.

Author

Tobias Pandolfo (@Tobidp) — LinkedIn

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
backup_validation.conf.example		backup_validation.conf.example
backup_validation.env.example		backup_validation.env.example
backup_validation.sh		backup_validation.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Proxmox Backup Validation

Features

How it works

Requirements

1. Create the isolated bridge

2. Install the script and config

3. Configure secrets (Telegram)

4. Configure the VM list

5. Run

6. Schedule via cron

Operating modes

Built-in signature library

How checks work

Failure diagnostics

Why cloudflared is a special case

Adding your own custom check

Safety: how it avoids touching production

Troubleshooting

Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Proxmox Backup Validation

Features

How it works

Requirements

1. Create the isolated bridge

2. Install the script and config

3. Configure secrets (Telegram)

4. Configure the VM list

5. Run

6. Schedule via cron

Operating modes

Built-in signature library

How checks work

Failure diagnostics

Why cloudflared is a special case

Adding your own custom check

Safety: how it avoids touching production

Troubleshooting

Author

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages