Automated test restores of Proxmox Backup Server (PBS) backups into isolated VMs. For each VM it restores the latest backup, boots it on an isolated bridge, validates boot → systemd services → TCP ports → HTTP endpoints, sends a Telegram report, and destroys the temporary VM.
Service checks (TCP/HTTP) run inside the guest via the QEMU guest agent, so they work even on a fully isolated bridge with no route and no DHCP.
Designed for clusters without shared storage (no Ceph): restores pull from > PBS, so a single "test node" can validate VMs from any node in the cluster.
- 🔁 Latest-snapshot test restore per VM, then automatic teardown
- 🧪 Validates boot, systemd units, listening ports and HTTP endpoints
- 🔍 Service auto-discovery from a signature library (+ manual overrides)
- 🔒 Isolated bridge, firewall disabled on the test NIC, NIC config preserved
- 🛡️ Single-run lock (
flock) and cleanup on interruption (INT/TERM) - 📨 Telegram notifications (success / failure / summary)
- 📸 On boot failure, a console screenshot is attached to the Telegram alert (lets you see GRUB / kernel panic / fsck / emergency shell at a glance)
- 🗂️ Per-run, per-VM log files
For each VMID listed in backup_validation.conf, the script runs this flow on
the PVE node:
backup_validation.conf PBS datastore (storage "pbs")
105 ; auto ; ─┐
110 ; hybrid ; ... │ per VMID
210 ; manual ; ... │
▼
┌─────────────────────────────────┐
│ 1. List backups in PBS │ pvesh get .../storage/pbs/content
│ 2. Filter by VMID, content=backup│ keep vmid == 105
│ 3. Pick the NEWEST (by ctime) │ → volid of the latest snapshot
└────────────────┬────────────────┘
▼
┌─────────────────────────────────┐
│ 4. Restore to a TEMP VMID │ qmrestore <volid> 9105 --unique
│ (prefix + VMID, e.g. 105→9105)│
└────────────────┬────────────────┘
▼
┌─────────────────────────────────┐
│ 5. Isolate NIC → vmbr99 │ swap bridge, firewall off
│ 6. Boot + wait for guest agent │
└────────────────┬────────────────┘
▼
┌─────────────────────────────────┐
│ 7. Validate INSIDE the guest │ systemd / TCP / HTTP / custom
│ boot, services, ports │
└────────────────┬────────────────┘
▼
┌─────────────────────────────────┐
│ 8. Telegram report (OK / FAIL) │ + console screenshot on boot fail
│ 9. Destroy the temp VM │ qm destroy 9105 --purge
└─────────────────────────────────┘
Key points:
- The source production VM is only read — its newest PBS backup is restored
into a separate temporary VMID (
TEMP_VMID_PREFIX+ the original VMID, e.g.105→9105). Production is never touched. - "Newest" is decided by the backup's creation time (
ctime); older snapshots are not tested. - The temporary VM is always destroyed at the end (and on interruption). See Safety.
On the PVE node (where the script runs):
which qm qmrestore pvesh python3 flock curl
apt install -y curl # if curl is missing (used for Telegram)
# Optional: only needed to convert console screenshots to PNG on boot failures,
# and only if your QEMU can't write PNG directly (PVE 8+ usually can).
apt install -y netpbm # provides pnmtopng (or use imagemagick's `convert`)On each Linux guest to be validated:
apt install -y qemu-guest-agent
systemctl enable --now qemu-guest-agent
# For HTTP checks, the guest also needs curl or wget (optional)On the PVE node, make sure the agent is enabled in the VM config:
qm set <VMID> --agent enabled=1Edit /etc/network/interfaces:
auto vmbr99
iface vmbr99 inet manual
bridge-ports none
bridge-stp off
bridge-fd 0
# No gateway, no route — fully isolated
Apply and verify:
ifreload -a
ip link show vmbr99DHCP is NOT required. Since checks run inside the guest, the test VM does not need an IP on the isolated bridge. The guest IP, when available, is only shown in the report ("Test IP"). If you want an IP shown, you can optionally run dnsmasq on
vmbr99— but it is purely cosmetic.
The script reads everything from /root by default:
cp backup_validation.sh /root/
chmod +x /root/backup_validation.sh
cp backup_validation.conf.example /root/backup_validation.conf # edit it
cp backup_validation.env.example /root/backup_validation.env # edit it
chmod 600 /root/backup_validation.env # contains secretsSave all files with LF (Unix) line endings. CRLF is tolerated for the
.conf/.env, but the.shmust be LF or the shebang breaks.
Secrets and per-node overrides live in backup_validation.env (sourced at
runtime), not in the script:
TELEGRAM_TOKEN="your_token_here"
TELEGRAM_CHAT_ID="your_chat_id_here"
# Optional — send to a specific topic of a forum supergroup:
#TELEGRAM_THREAD_ID="123"TELEGRAM_THREAD_ID is the message_thread_id of a topic inside a Telegram
group with topics (forum supergroup). Leave it empty to post to the main chat.
Any default from the script can be overridden here (storage names, timeouts, paths, bridge, etc.). Leave Telegram empty to disable notifications.
Edit /root/backup_validation.conf. One VM per line, ;-separated:
# vmid ; mode ; overrides
105 ; auto ;
110 ; hybrid ; cloudflared:0
210 ; manual ; myapp:8080
# Full cycle (all VMs in the config file)
/root/backup_validation.sh
# One-off test of a single VM
/root/backup_validation.sh --vmid 105
/root/backup_validation.sh --vmid 105 --mode hybrid --overrides "cloudflared:0"
/root/backup_validation.sh --vmid 105 --mode manual --overrides "myapp:8080"Logs are written to /var/log/backup_validation/<MM-DD-YY_HHhMM>/:
/var/log/backup_validation/06-09-26_02h00/
├── _cycle.log # cycle-level: pre-checks, summary
├── 105/105.log # per-VM test log
└── 110/110.log
# Weekly, Sundays at 02:00
cat > /etc/cron.d/backup-validation <<'EOF'
0 2 * * 0 root /root/backup_validation.sh
EOF| Mode | Use | Example line |
|---|---|---|
auto |
Default services from the signature library | 100;auto; |
hybrid |
Auto-discovery + custom services | 101;hybrid;api:8080 |
manual |
Custom services only | 102;manual;myapp:9000 |
Override format: service:port,service:port. Use port 0 to skip the port/HTTP
check and validate only the systemd unit (e.g. cloudflared:0).
| Service | Check |
|---|---|
Apache (apache2/httpd) |
HTTP/80 |
| Nginx | HTTP/80 |
| PostgreSQL | TCP/5432 |
| MariaDB/MySQL | TCP/3306 |
| Redis | TCP/6379 |
| MongoDB | TCP/27017 |
| ClickHouse | TCP/9000 |
| Docker | systemd only |
| Elasticsearch | HTTP/9200 |
| RabbitMQ | TCP/5672 |
| cloudflared | systemd + log scan |
Each signature has the form PORT:PROTOCOL:ENDPOINT. The PROTOCOL field
decides which validation runs after the always-on systemd check:
| Protocol | Validation | What it checks |
|---|---|---|
| (any) | check_systemd |
The unit reports active (always runs first) |
tcp |
check_tcp_guest |
A socket is in LISTEN on the port (ss -ltn, inside the guest) |
http / https |
check_http_guest |
A request to 127.0.0.1:PORT returns a valid HTTP code |
none |
— | systemd only (no port to probe, e.g. Docker) |
custom_* |
custom function | Service-specific logic (see below) |
All network checks run inside the guest via the QEMU guest agent, so they work regardless of guest networking, DHCP or routing.
When a check fails, the report tells you what went wrong, not just that it did:
- HTTP: the actual status code is shown in the alert (e.g.
✗ HTTP 80 (nginx) [500]) and appended to the reason (HTTP_FAIL_nginx_80_500). - systemd: the unit state and reason are shown (e.g.
✗ systemd: nginx [failed — ActiveState=failed SubState=failed Result=exit-code]), and the fullsystemctl status+ last 15 journal lines are written to the per-VM log. - Boot: a console screenshot is attached to the Telegram alert (see Features).
cloudflared (Cloudflare Tunnel) opens no local listening port — it makes
outbound connections to Cloudflare's edge. So TCP/HTTP probing makes no sense.
Instead, check_cloudflared inspects journalctl -u cloudflared and counts only
critical errors. Because the test VM runs on an isolated bridge with no
internet, connectivity errors are expected and are filtered out
(network is unreachable, no such host, dial tcp, context deadline) — only
real problems (bad config, invalid credentials, parse errors) cause a failure.
The custom_* protocol is an extension hook. To validate a service that doesn't
fit the port/HTTP model (a queue worker, a backup daemon, etc.):
- Add a signature with a
custom_<name>protocol (port0):["myservice"]="0:custom_myservice:"
- Write a
check_myservice()function (useguest_execto run commands inside the guest, mirroringcheck_cloudflared). - Route it in step 7.2 of
process_vm():case "$proto" in custom_cloudflared) ... ;; custom_myservice) if check_myservice "$temp_vmid"; then checks="${checks} ✓ My check (${svc})\n" else checks="${checks} ✗ My check (${svc})\n" final_result="FAIL"; failure_reason="MYSERVICE_FAIL_${svc}" fi ;; esac
The whole safety model rests on VMID separation. The script never deletes a production VM because:
- Production and test IDs never cross. The source VMID (e.g.
105) is only read (its PBS backup is restored). The restore writes to a separate temporary VMID built asTEMP_VMID_PREFIX+ the original VMID (e.g.105→9105).qm stopandqm destroyare only ever called with the temporary VMID. - The source VM is never started, stopped or destroyed — only its backup is read.
- It refuses to reuse an existing VMID. Before restoring, it checks whether
the temporary VMID already exists; if so it aborts/skips (
TEMP_VMID_BUSY) instead of overwriting or deleting. So it only ever destroys a VM it created itself at a free VMID. - Hard guard in cleanup.
cleanup_vmrefuses to destroy any VMID that does not match the temp scheme (TEMP_VMID_PREFIX+ a real VMID) — even a bug or misconfiguration cannot make it touch a production ID. - Visible marker. The test VM is tagged
backup-validation-tempwith a "safe to delete" description, so it is obvious inqm list/ the web UI.
⚠️ Note: the restored test VM inherits the production VM's name (the full config is restored). The name is therefore not a safe discriminator — only the VMID is. Make sure the prefixed IDs can't collide with real VMs — e.g. with the default prefix9, avoid having a production VM9105while testing105. Keeping production VMIDs in the 100–999 range (Proxmox's convention) avoids this.
Guest agent not responding
- Check
systemctl status qemu-guest-agentinside the VM - Confirm
qm set <VMID> --agent enabled=1on the original VM
HTTP check fails but the service is up
- The check runs against
127.0.0.1inside the guest. If the service binds to a specific interface (not0.0.0.0/localhost), use atcpoverride instead — the TCP check usesss -ltnand matches any listen address.
Temporary VMID busy
- Look for leftovers:
qm list | grep '^ *900' - Clean up manually:
qm destroy <VMID> --purge 1
Restore fails
- Check destination storage:
pvesm status - Check free space:
df -h
Another run in progress
- A
flockprevents overlapping runs. If a previous run is still going (long restore), the new one aborts. Lock file:/var/lock/backup_validation.lock.
Tobias Pandolfo (@Tobidp) — LinkedIn
MIT © 2026 Tobias Pandolfo. Free to use and modify, as long as the original copyright notice is kept.