- Overview
- Usage
- Installation
- Configuration
- VM State Handling
- Backup Types & Strategies
- Checkpoint System
- Rotation Policies
- Archive Chain Management
- Backup Lifecycle & Retention Management
- Failure Detection & Self-Remediation
- Security & Permissions
- TPM Module Integration
- SQLite Logging System
- Session Summary & Tracking
- Email Reporting System
- Interrupt Recovery
- File Inventory
- Replication Architecture
- Directory Structure Evolution: Three-Month Extrapolation
- Known Issues & Mitigations
- Exit Codes
100% vibe coded. Could be 100% wrong.
Appropriate testing in any and all environments is required. Build your own confidence that the backups work.
Backups are only as good as your restores. All backups are worthless if you cannot recover from them. vmrestore might be your answer.
vmbackup and vmrestore are two halves of one system, and as of 0.6.0 they ship together as a single package containing two binaries. vmbackup backs up — vmrestore restores. Both carry the same version, draw on one shared lib/ of cross-tool helpers, and read from a single SQLite catalogue, so the two halves can no longer drift apart by accident. vmrestore still operates standalone for disaster recovery — it restores directly from the backup files even when the catalogue is unavailable.
vmbackup is a production-grade but production untested backup manager for libvirt/KVM virtual machines. It wraps virtnbdbackup (v2.28+) to provide:
- Incremental/full backup automation with checkpoint tracking
- Multi-state VM handling (running, shut off, paused)
- QEMU guest agent integration for application-consistent snapshots
- TPM state preservation for Windows BitLocker / Linux Secure Boot
- Automatic recovery from checkpoint corruption
- SQLite-based activity logging with structured event tables
- Email notifications after backup completion
- Archive chain preservation for point-in-time recovery
- Per-instance configuration for different environments
- Rotation policies (daily, weekly, monthly, accumulate, never)
- Accurate session tracking with separate counts for backed-up, excluded, skipped, and failed VMs
- Unified packaging with vmrestore — one install, one version, a shared SQLite catalogue
- Restore-session tracking in that catalogue, queryable with
vmbackup --status --restores
vmbackup executes in a fixed sequence: load the configuration instance, verify dependencies, acquire the session lock and open a SQLite session. It then discovers all VMs via virsh list and applies exclude filters.
Each VM is processed in turn — policy and period boundary checks, five-phase state validation (VM state, checkpoint-chain continuity, directory analysis, state classification, and a validation summary), backup type decision (full, auto, or copy), optional FSTRIM via the QEMU agent, then virtnbdbackup execution with TPM backup if applicable. Every outcome is logged to SQLite.
After all VMs are processed, retention cleanup removes expired archives, replication syncs to local and cloud destinations, and a session summary is written. An email report is sent if configured.
vmbackup <MODE> [OPTIONS]
Requires root. vmbackup runs as a systemd timer by default but can be invoked manually.
Exactly one mode is required per invocation. Modes are mutually exclusive.
| Mode | Purpose |
|---|---|
--run |
Start a backup session. Backs up all VMs (or those specified by --vm), then runs replication and retention. |
--prune <target> |
Remove backup data without running a backup. See On-Demand Cleanup for targets. |
--replicate-only [scope] |
Run replication without backing up. Scope: local, cloud, or both (default). Cannot combine with --vm. |
--status [report] |
Query backup history, restore history, failures, chain health, storage and policies. Read-only — no locks, no session. See Status Reports. |
--cancel-replication |
Signal a running session to stop its replication phase. Backups in progress are not affected. Cannot combine with any other flag. |
--config-prune-removed |
Comment out configuration variables that have been removed from the codebase in the running release. Idempotent and reversible (lines are commented, not deleted). Operates on default/ and all custom instances; skips template/. Use with --dry-run to preview. |
--cleanup-stale-manifests |
Remove leftover per-VM chain-manifest.json files. The chain-manifest subsystem was retired in 0.6.0 — the SQLite catalogue is now the single source of truth for chain state. Run automatically by the package on upgrade; idempotent and safe to re-run manually. |
| Option | Applies to | Description |
|---|---|---|
--vm NAME |
--run, --prune, --status |
Target specific VM(s), comma-separated. With --run, replication is skipped. With --prune, only a single VM name is accepted. With --status, selects a VM for --status (history) or --status --chains (detail). |
--dry-run |
--run, --prune, --replicate-only |
Preview without writing anything. |
--config-instance NAME |
all modes | Load config from config/NAME/ instead of config/default/. |
--yes, -y |
--prune |
Skip confirmation prompt (for scripted use). |
--days N |
--status |
Time window in days (default: 1 = today). |
--csv |
--status |
CSV output with both raw and human-readable columns. |
--all-instances |
--status (sessions) |
Show sessions from all config instances sharing the active database (default: scoped to --config-instance). |
--help, -h |
— | Show help and exit. |
--version |
— | Show version and exit. |
The following combinations are rejected at startup:
| Rejected combination | Reason |
|---|---|
--run + --prune |
Backup and cleanup are separate operations |
--run + --replicate-only |
Replication already runs as part of --run |
--replicate-only + --vm |
Replication operates on the entire backup path |
--prune + --replicate-only |
Separate operations |
--cancel-replication + anything |
Standalone signal only |
--prune + multiple VMs |
Prune requires a single VM name |
--status + --run |
Status is read-only; backup is a separate operation |
--status + --prune |
Separate operations |
--status + --replicate-only |
Separate operations |
--vm without --run, --prune or --status |
--vm modifies a mode; it is not a mode itself |
A global PID lock ($STATE_DIR/vmbackup.pid) prevents concurrent vmbackup invocations. If a scheduled backup is already running, a manual invocation will fail with a clear error.
# Full backup (all VMs, default config)
sudo vmbackup --run
# Single VM
sudo vmbackup --run --vm web
# Multiple VMs
sudo vmbackup --run --vm web,db,mail
# Preview without writing anything
sudo vmbackup --run --dry-run
# Named config instance
sudo vmbackup --run --config-instance prod
# Preview with named config
sudo vmbackup --run --config-instance test --dry-run
# Show backup inventory
sudo vmbackup --prune list
# Single VM inventory
sudo vmbackup --prune list --vm myvm
# Remove archived chains for a VM
sudo vmbackup --prune archives --vm myvm
# Preview archive removal
sudo vmbackup --prune archives --dry-run
# Remove everything for a VM (destructive)
sudo vmbackup --prune all --vm myvm --yes
# Replication only (both local + cloud)
sudo vmbackup --replicate-only
# Preview local replication only
sudo vmbackup --replicate-only local --dry-run
# Cancel replication on a running session
sudo vmbackup --cancel-replication
# Status reports (read-only — no locks, no session)
sudo vmbackup --status # Today's sessions
sudo vmbackup --status --days 7 # Last 7 days
sudo vmbackup --status --vm web # VM backup history
sudo vmbackup --status --failures --days 30 # Failures last 30 days
sudo vmbackup --status --replication --days 7 # Replication status
sudo vmbackup --status --chains # Chain health overview
sudo vmbackup --status --chains --vm web # Chain detail for one VM
sudo vmbackup --status --storage # Storage per VM
sudo vmbackup --status --policies # Rotation policy summary
sudo vmbackup --status --restores --days 30 # Restore history (from vmrestore)
sudo vmbackup --status --failures --csv # CSV export
sudo vmbackup --status --all-instances --days 7 # Sessions across all instancesvmbackup is a wrapper around virtnbdbackup — it will not function without it. Install virtnbdbackup before vmbackup. See Dependencies for the full list of required and optional packages.
Download the latest .deb from Releases:
# Download the latest vmbackup_<version>_all.deb from the Releases page, then:
sudo dpkg -i vmbackup_*_all.debThe single .deb installs both vmbackup and vmrestore. When upgrading from a standalone vmrestore install, the old package is removed automatically — the package declares Provides/Replaces/Conflicts: vmrestore.
git clone https://github.com/doutsis/vmbackup.git
cd vmbackupOption 1 — Build .deb package (Debian / Ubuntu):
make package
sudo dpkg -i build/vmbackup_*.debOption 2 — Direct install (any distro):
sudo make installBoth methods install to /opt/vmbackup/ and set up:
vmbackupandvmrestorecommands in PATH (symlinks in/usr/local/bin/tovmbackup.shandvmrestore.sh)- AppArmor snippet for virtnbdbackup socket access
- systemd service and timer units (enabled but not started — see below)
root:backupownership with750/640permissions
Systemd timer: The installer enables
vmbackup.timerso it activates on next boot, but does not start it immediately. To begin scheduled backups now:sudo systemctl start vmbackup.timer # daily at 1:00 AMThe timer status shows the next scheduled run:
systemctl status vmbackup.timer
The packaged install (
.deb) andmake installboth create thebackupgroup, applyroot:backupownership, and bootstrap the runtime directories. The manual steps below reproduce that model explicitly — omitting the group/ownership/runtime-directory steps will leave the tree inconsistent with the Security & Permissions model (group-based read access will not work, and/var/log/vmbackup//run/vmbackupwill be missing).
# Clone the repository
git clone https://github.com/doutsis/vmbackup.git /opt/vmbackup
# Create symlinks for both commands
sudo ln -sf /opt/vmbackup/vmbackup.sh /usr/local/bin/vmbackup
sudo ln -sf /opt/vmbackup/vmrestore.sh /usr/local/bin/vmrestore
# Create the backup group and apply the root:backup ownership model
sudo groupadd --system backup 2>/dev/null || true
sudo chown -R root:backup /opt/vmbackup
sudo chmod 750 /opt/vmbackup
# Bootstrap the runtime log/run directories (root:backup, 750)
sudo mkdir -p /var/log/vmbackup /var/log/vmrestore /run/vmbackup
sudo chown root:backup /var/log/vmbackup /var/log/vmrestore /run/vmbackup
sudo chmod 750 /var/log/vmbackup /var/log/vmrestore /run/vmbackup
# Install AppArmor snippet (Debian/Ubuntu only)
sudo cp /opt/vmbackup/apparmor/libvirt-qemu.local \
/etc/apparmor.d/local/abstractions/libvirt-qemu
for f in /etc/apparmor.d/libvirt/libvirt-*; do
[[ "$f" == *.files || "$(basename "$f")" == "TEMPLATE.qemu" ]] && continue
sudo apparmor_parser -r "$f"
done
# Install systemd timer
sudo cp /opt/vmbackup/systemd/vmbackup.service /etc/systemd/system/
sudo cp /opt/vmbackup/systemd/vmbackup.timer /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now vmbackup.timer
# Create initial configuration
cp -r /opt/vmbackup/config/template /opt/vmbackup/config/default
nano /opt/vmbackup/config/default/vmbackup.confDependencies are not installed by these steps. The manual path (like
make install) installs vmbackup files only — provision the prerequisites from the Dependencies table separately.
| Tool | Package | Required | Notes |
|---|---|---|---|
virtnbdbackup |
virtnbdbackup ≥2.28 | Yes | The backup engine that vmbackup wraps |
bash |
bash ≥5.0 |
Yes | |
virsh |
libvirt-daemon-system |
Yes | Libvirt VM management |
qemu-img |
qemu-utils |
Yes | Disk image operations |
sqlite3 |
sqlite3 |
Yes | Activity logging database |
jq |
jq |
Yes | JSON parsing |
ionice |
util-linux |
Yes | I/O priority control |
swtpm |
swtpm |
For TPM VMs | Software TPM emulation |
| QEMU Guest Agent | qemu-guest-agent (in guest) |
For online VMs | Filesystem quiescing and FSTRIM |
msmtp |
msmtp |
For email reports | Email notifications |
rsync |
rsync |
For replication | Local replication sync |
rclone |
rclone |
For cloud replication | Cloud replication |
When installed via the .deb package, all required packages except virtnbdbackup are pulled in automatically as Debian dependencies. make install installs the vmbackup files only — it does not run a package manager, so on the source-install path you must provision the prerequisites yourself (see the table above). virtnbdbackup is always installed separately, from your distro's package manager or directly from the virtnbdbackup project.
Development and testing has been done on Debian 13 with virtnbdbackup 2.28 from Debian's repository.
vmbackup runs as root via systemd and enforces an EUID check at startup — it will refuse to run unprivileged because every step (libvirt checkpoint manipulation, virtnbdbackup NBD sockets, TPM state read, BACKUP_PATH writes under restrictive umask) needs root. This is asymmetric with vmrestore, which deliberately allows non-root invocation for read-only listings, metadata dumps, and disk extraction into user-writable paths — a deliberate concession to disaster-recovery workflows that may run on a recovery laptop without sudo configured.
No special user account is needed for backup operations. However, non-root users who need to interact with backups or VMs should be added to the appropriate groups:
| Group | Purpose | Who needs it | Command |
|---|---|---|---|
backup |
Read access to backup data, logs, configs, and scripts under BACKUP_PATH and /opt/vmbackup/ |
Any user who browses backups or checks logs | sudo usermod -aG backup <username> |
libvirt |
Read-only access to libvirt/virsh (VM listing, status, domain info) | Any user who needs virsh list, virsh dominfo |
sudo usermod -aG libvirt <username> |
# Example: grant both groups to a user
sudo usermod -aG backup,libvirt myuser
# Log out and back in (or: newgrp backup) for group membership to take effectThe backup group (GID 34) is a standard Debian system group. The .deb package creates it if it doesn't exist. The libvirt group is created by the libvirt-daemon-system package.
Note: Group membership changes require a new login session. Running
newgrp backupin an existing shell applies the group for that shell only.
On Debian/Ubuntu systems with AppArmor enabled, QEMU is restricted from creating virtnbdbackup's NBD sockets in /var/tmp/ by default. This causes backups to fail with:
Failed to bind socket to /var/tmp/virtnbdbackup.XXXX: Permission denied
Detection: vmbackup.sh detects a missing AppArmor override during check_dependencies() and logs an error with remediation commands. The backup will not proceed until the override is installed.
Fix via .deb package (recommended):
The .deb package installs the AppArmor snippet automatically to /etc/apparmor.d/local/abstractions/libvirt-qemu and reloads all libvirt VM profiles during postinst. No manual action is required when installing via the package.
Manual fix:
sudo mkdir -p /etc/apparmor.d/local/abstractions
echo '# Allow virtnbdbackup NBD sockets (installed by vmbackup)
/var/tmp/virtnbdbackup.* rwk,' | sudo tee /etc/apparmor.d/local/abstractions/libvirt-qemu
# Reload all libvirt VM profiles
for f in /etc/apparmor.d/libvirt/libvirt-*; do
[[ "$f" == *.files || "$(basename "$f")" == "TEMPLATE.qemu" ]] && continue
sudo apparmor_parser -r "$f"
doneNote: If
VIRTNBD_SCRATCH_DIRis changed from the default/var/tmp, the AppArmor rule path must match.
The backup script is scheduled via a systemd timer which is installed and enabled automatically by the .deb package or make install. The default timer runs daily at 1:00 AM.
The service has a default timeout of 12 hours (TimeoutStartSec=43200). If a backup run exceeds this, systemd will terminate it. Monitor your backup durations and increase this value with sudo systemctl edit vmbackup.service if needed.
/lib/systemd/system/vmbackup.timer:
[Unit]
Description=VM Backup Timer
Documentation=https://github.com/doutsis/vmbackup
[Timer]
# Run daily at 1:00 AM, with up to 5 minutes random delay to avoid thundering herd
OnCalendar=*-*-* 01:00:00
Persistent=true
RandomizedDelaySec=300
[Install]
WantedBy=timers.target/lib/systemd/system/vmbackup.service:
[Unit]
Description=VM Backup Service
Documentation=https://github.com/doutsis/vmbackup
After=libvirtd.service
Requires=libvirtd.service
[Service]
Type=oneshot
ExecStart=/opt/vmbackup/vmbackup.sh --run
# Uses the 'default' config instance at /opt/vmbackup/config/default/
# To use a different config instance:
# sudo systemctl edit vmbackup.service
# Then add:
# [Service]
# ExecStart=
# ExecStart=/opt/vmbackup/vmbackup.sh --run --config-instance myinstance
TimeoutStartSec=43200
Nice=10
IOSchedulingClass=best-effort
IOSchedulingPriority=5
# Security: restrict file creation permissions
# Dirs → 750 (rwxr-x---), Files → 640 (rw-r-----)
# Belt-and-suspenders with umask 027 in vmbackup.sh itself.
UMask=0027
# Logging goes to journald (accessible via: journalctl -u vmbackup.service)
StandardOutput=journal
StandardError=journal
SyslogIdentifier=vmbackup
[Install]
WantedBy=multi-user.target# Check timer status
systemctl list-timers vmbackup.timer
# Manual run
sudo systemctl start vmbackup.service
# View logs
journalctl -u vmbackup.service -n 100After installing the .deb package, the config/default/ directory contains template configuration with safe defaults (features disabled, placeholder values). You must configure at minimum vmbackup.conf before running backups.
The steps below configure the default instance. vmbackup supports multiple config instances — isolated sets of configuration for different environments (e.g., production, test, dr). To create additional instances, copy config/template/ to a new directory and configure it independently. See Configuration Instances for full details.
sudo nano /opt/vmbackup/config/default/vmbackup.confMinimum settings to review for a real deployment:
| Setting | What to set | Example |
|---|---|---|
BACKUP_PATH |
Directory where backups are stored (must exist — see below) | /mnt/backup/vms/ |
LOG_LEVEL |
Verbosity: ERROR, WARN, INFO, DEBUG |
INFO |
Both settings have runtime defaults in
vmbackup.sh(BACKUP_PATHfalls back to/mnt/backup/vms/,LOG_LEVELtoINFO), and the shippedconfig/default/vmbackup.confalready carries explicit values — so vmbackup will not fail merely because you didn't touch them. They are listed here because a production install should setBACKUP_PATHdeliberately rather than rely on the fallback, not because the runtime contract strictly requires them.
vmbackup does not create BACKUP_PATH — it must exist before the first run. If the directory is missing, vmbackup exits with an error (Backup path does not exist).
sudo mkdir -p /mnt/backup/vmsOn first run, vmbackup automatically applies the correct ownership and permissions via ensure_backup_path_sgid():
| Property | Value | Meaning |
|---|---|---|
| Owner | root:backup |
Root writes backups; backup group members can read |
| Mode | 2750 |
rwxr-x--- with SGID bit set |
| SGID effect | Group inheritance | All subdirectories and files created under BACKUP_PATH automatically inherit the backup group |
This means:
- You only need to
mkdir— vmbackup handles ownership and permissions automatically - All VM backup directories, state files, logs, and chain data inherit
backupgroup via SGID - Users in the
backupgroup can browse and read all backup data withoutsudo - The
_state/directory (SQLite DB, logs, temp files) is created underBACKUP_PATHwith the same inherited permissions - Exception: TPM private keys (
tpm-state/) have SGID stripped and are locked toroot:root 600
NFS mounts: If
BACKUP_PATHis on NFS withroot_squash(the default), root is mapped tonobodyand cannot set ownership. Either export withno_root_squashor pre-set the directory ownership on the NFS server.
Optional but recommended settings to review:
DEFAULT_ROTATION_POLICY— default retention policy for all VMs (monthlyis the default)RETENTION_MONTHS/RETENTION_WEEKS/RETENTION_DAYS— how many periods to keepPROCESS_PRIORITY,IO_PRIORITY_CLASS,IO_PRIORITY_LEVEL— resource usage tuningENABLE_FSTRIM— trim VM disks before backup (reduces size, requires QEMU guest agent). Enabled by default.FSTRIM_MINIMUM— minimum TRIM extent (bytes). Default1048576(1 MiB). Linux only; Windows ignores this.FSTRIM_EXCLUDE_FILE— file containing VM name patterns to exclude from FSTRIMENABLE_AUTO_RECOVERY_ON_CHECKPOINT_CORRUPTION— set to"yes"for unattended self-healing
Most settings have sensible defaults baked into vmbackup.sh — the config file only needs to contain values you want to override.
sudo nano /opt/vmbackup/config/default/vm_overrides.confSet per-VM rotation policies if you want different VMs to have different retention:
declare -gA VM_POLICY
VM_POLICY["critical-db"]="accumulate" # Never auto-delete
VM_POLICY["dev-sandbox"]="daily" # Short retention
VM_POLICY["template-vm"]="never" # Skip entirelyIf you have no per-VM overrides, leave this file as-is.
sudo nano /opt/vmbackup/config/default/exclude_patterns.confAdd glob patterns to exclude VMs by name:
EXCLUDE_PATTERNS+=("test-*") # Exclude all test VMs
EXCLUDE_PATTERNS+=("*-template") # Exclude template VMsLeave empty (EXCLUDE_PATTERNS=()) to back up all VMs.
sudo nano /opt/vmbackup/config/default/email.confPrerequisites: msmtp must be installed and configured at /etc/msmtprc.
| Setting | What to set |
|---|---|
EMAIL_ENABLED |
"yes" to enable email reports |
EMAIL_RECIPIENT |
Your alert email address |
EMAIL_SENDER |
Must match your SMTP config |
Other settings control conditional sending (success-only, failure-only, etc.) and what data to include in reports. See the file comments for details.
sudo nano /opt/vmbackup/config/default/replication_local.confConfigures rsync-based replication to local/NAS/remote storage after backups complete.
To enable local replication:
- Set
REPLICATION_ENABLED="yes" - Configure at least one destination:
- Set
DEST_1_ENABLED="yes" - Set
DEST_1_PATHto your replication target (local mount, NFS, etc.) - Choose
DEST_1_TRANSPORT:local(rsync to any mounted path — local disk, NFS, virtiofs, pre-mounted CIFS). Additional transports can be added by implementing the transport contract.
- Set
- For SSH transport, also set
DEST_N_HOST,DEST_N_USER,DEST_N_PORT,DEST_N_SSH_KEY
If you don't need local replication, set REPLICATION_ENABLED="no" (this is the template default). Individual destinations can be toggled with DEST_N_ENABLED when the master switch is on.
sudo nano /opt/vmbackup/config/default/replication_cloud.confConfigures rclone-based replication to cloud storage (SharePoint, Backblaze B2, etc.).
To enable cloud replication:
- Install rclone:
sudo apt install rclone - Configure an rclone remote:
rclone config(or usesharepoint_auth.shfor SharePoint) - Set
CLOUD_REPLICATION_ENABLED="yes" - Configure a cloud destination:
- Set
CLOUD_DEST_1_ENABLED="yes" - Set
CLOUD_DEST_1_REMOTEto your rclone remote name (e.g.,sharepoint:) - Set
CLOUD_DEST_1_PATHto the destination folder
- Set
See the Cloud Authentication (SharePoint) section for SharePoint-specific setup.
If you don't need cloud replication, set CLOUD_REPLICATION_ENABLED="no" (this is the template default). Individual destinations can be toggled with CLOUD_DEST_N_ENABLED when the master switch is on.
# Dry run to verify configuration (no backups performed)
sudo vmbackup --run --dry-run
# Check the systemd timer is active
systemctl list-timers vmbackup.timer
# View upcoming schedule
systemctl status vmbackup.timerThe systemd timer runs daily at 01:00 by default. To change the schedule or use a different config instance:
# Create a systemd override
sudo systemctl edit vmbackup.timer # For schedule changes
sudo systemctl edit vmbackup.service # For config instance changesTo change the config instance, add to the service override:
[Service]
ExecStart=
ExecStart=/opt/vmbackup/vmbackup.sh --run --config-instance productionTo change the schedule, add to the timer override:
[Timer]
OnCalendar=
OnCalendar=*-*-* 02:30:00To run multiple backup configurations (e.g., different schedules for different VM groups):
# Copy the template to a new instance
sudo cp -r /opt/vmbackup/config/template /opt/vmbackup/config/production
# Edit the new instance
sudo nano /opt/vmbackup/config/production/vmbackup.conf
# Test it
sudo vmbackup --run --config-instance production --dry-runConfiguration instances provide complete isolation between different backup environments. Each instance has its own set of configuration files, allowing different settings for production, test, disaster recovery, or other purposes.
| Command | Config Used |
|---|---|
./vmbackup.sh --run |
config/default/ |
./vmbackup.sh --run --config-instance test |
config/test/ |
./vmbackup.sh --run --config-instance myinstance |
config/myinstance/ |
config/
├── default/ # Default configuration (customize after install)
│ ├── vmbackup.conf # Main settings + replication order
│ ├── vm_overrides.conf # Per-VM rotation policies
│ ├── exclude_patterns.conf # Glob patterns to exclude VMs
│ ├── fstrim_exclude.conf # VM patterns to exclude from FSTRIM
│ ├── email.conf # Email notification settings
│ ├── replication_local.conf # Local/NAS replication destinations
│ └── replication_cloud.conf # Cloud replication destinations
│
├── test/ # Test environment (isolated from production)
│ ├── vmbackup.conf
│ ├── vm_overrides.conf
│ ├── exclude_patterns.conf
│ ├── fstrim_exclude.conf
│ ├── email.conf
│ ├── replication_local.conf
│ └── replication_cloud.conf
│
└── template/ # Template for creating new instances
├── vmbackup.conf # Full documentation of all options
├── vm_overrides.conf
├── exclude_patterns.conf
├── fstrim_exclude.conf
├── email.conf
├── replication_local.conf
└── replication_cloud.conf
# Copy template to new instance
cp -r config/template config/production
# Edit the new instance
nano config/production/vmbackup.conf
# Use the new instance
./vmbackup.sh --run --config-instance productionAll config files are self-documenting — config/template/ contains every setting with inline comments. Only settings you want to override need to be present; omitted settings use code defaults. This section highlights groupings, gotchas, and non-obvious interactions.
| Setting | Default | Description |
|---|---|---|
BACKUP_PATH |
(required) | Backup root directory. Must exist before first run (mkdir -p). SGID applied automatically on first run. |
DISK_ABORT_PCT |
20 |
Pre-flight aborts if free space on BACKUP_PATH drops below this percent. 0 disables the percent abort. |
DISK_WARN_PCT |
30 |
Pre-flight logs a warning (does not abort) if free space drops below this percent. 0 disables. |
DISK_ABORT_GB |
10 |
Pre-flight aborts if absolute free space drops below this many GB. 0 disables the GB abort. Either threshold can fire independently. |
DISK_WARN_GB |
50 |
Pre-flight warns if absolute free space drops below this many GB. 0 disables. |
PROCESS_PRIORITY |
10 |
CPU nice: −20 (highest) to 19 (lowest) |
IO_PRIORITY_CLASS |
2 |
ionice class: 1=realtime, 2=best-effort, 3=idle |
IO_PRIORITY_LEVEL |
5 |
0–7 within class (0=highest) |
VIRTNBD_COMPRESS_LEVEL |
4 |
LZ4 level. 1–2=fast (~1000 MiB/s), 3–16=HC (~50 MiB/s). See gotchas. |
DEFAULT_ROTATION_POLICY |
monthly |
daily / weekly / monthly / accumulate / never. never=excluded from backup. |
RETENTION_DAYS |
7 |
Periods kept under daily policy |
RETENTION_WEEKS |
4 |
Periods kept under weekly policy |
RETENTION_MONTHS |
3 |
Periods kept under monthly policy |
RETENTION_ORPHAN_ENABLED |
true |
Age-based cleanup of orphaned period dirs. See gotchas. |
RETENTION_ORPHAN_MAX_AGE_DAYS |
90 |
Delete orphans older than this |
RETENTION_ORPHAN_MIN_AGE_DAYS |
7 |
Safety buffer before eligible |
RETENTION_ORPHAN_DRY_RUN |
false |
Log actions but don't delete |
ACCUMULATE_WARN_DEPTH |
100 |
Warn when chain depth exceeds this (template recommends 30) |
ACCUMULATE_HARD_LIMIT |
365 |
Force archive at this depth (template recommends 60) |
CHECKPOINT_FORCE_FULL_ON_DAY |
1 |
Day of month to force a full backup (resets chain) |
CHECKPOINT_HEALTH_CHECK |
yes |
Validate checkpoint chain integrity before backup |
LOG_LEVEL |
INFO |
ERROR / WARN / INFO / DEBUG. Controls stderr only; log file always gets everything. |
MAX_RETRIES |
3 |
Retry failed VM backups (0=disable). Retries convert incremental→full. |
VIRTNBD_SCRATCH_DIR |
/var/tmp |
Temp directory for NBD operations |
ENABLE_FSTRIM |
true |
Pre-backup TRIM via QEMU guest agent |
FSTRIM_MINIMUM |
1048576 |
Minimum TRIM extent bytes. Linux only; Windows ignores this. |
FSTRIM_TIMEOUT |
300 |
Linux guest FSTRIM timeout (seconds) |
FSTRIM_WINDOWS_TIMEOUT |
600 |
Windows guest FSTRIM timeout (seconds) |
FSTRIM_EXCLUDE_FILE |
fstrim_exclude.conf |
File of VM name patterns to skip |
SKIP_OFFLINE_UNCHANGED_BACKUPS |
true |
When true, skip offline VMs whose disk files have not been modified since the last backup. When false, offline VMs are backed up unconditionally on every run. |
ENABLE_AUTO_RECOVERY_ON_CHECKPOINT_CORRUPTION |
yes |
yes=self-heal, warn=log remediation steps but fail, no=fail immediately |
REPLICATION_ORDER |
simultaneous |
simultaneous / local_first / cloud_first |
STATE_BACKUP_KEEP_DAYS |
90 |
Days to keep daily _state/ snapshots |
LOG_KEEP_DAYS |
30 |
Days to keep log files (0=disable) |
LOG_MAX_BYTES |
52428800 (50 MiB) |
Size cap for the central vmbackup.log / vmprune.log before rotation. Oversized file is rolled to <name>.<epoch> and aged out via LOG_KEEP_DAYS. See Log Retention & Cleanup. |
Settings documented in specialist sections. A few real, code-backed knobs live with their feature rather than in this table:
BITLOCKER_KEY_EXTRACTIONandBITLOCKER_EXEC_TIMEOUT(see TPM Module Integration), and the full log-rotation behaviour aroundLOG_MAX_BYTES(see Log Retention & Cleanup). They are valid invmbackup.confeven though they are detailed elsewhere.
I/O priority profiles (tested estimates for typical VM workloads):
| Profile | CPU | I/O Class | I/O Level | Speed | Use Case |
|---|---|---|---|---|---|
| Minimal Impact | 19 | 3 (idle) | — | ~1–3 MB/s | Production hours |
| Balanced | 10 | 2 | 5 | ~15–30 MB/s | Default |
| Normal | 0 | 2 | 4 | ~50–80 MB/s | Dedicated window |
| Maximum | −10 | 1 | 4 | ~200+ MB/s | Emergency/DR |
Gotchas:
- LZ4 level 0 is broken in virtnbdbackup ≤2.28 — vmbackup detects this and bumps to 1. Fast mode (1–2) and HC mode (3–16) produce virtually identical compression ratios for VM disk data (~1.30× vs ~1.31×), but HC is ~15× slower.
- Orphan retention — when a VM's rotation policy changes (e.g. weekly→monthly), old period directories become "orphaned" and invisible to count-based retention. The four
RETENTION_ORPHAN_*settings provide age-based cleanup. Start withDRY_RUN=trueto preview what would be removed. - FSTRIM on Windows is significantly slower without the VirtIO
discard_granularityXML fix — see Known Issues. VMs without a QEMU guest agent are skipped automatically; the exclude file is for VMs that have an agent but should still be excluded (e.g., database servers where TRIM causes latency spikes, legacy guests with buggy drivers). - Each config instance should use a unique
BACKUP_PATH. The directory structure is$BACKUP_PATH/<vm_name>/<period>/, where<period>depends on the rotation policy (YYYYMMDDdaily,YYYY-Wwwweekly,YYYYMMmonthly;accumulatehas no period subdirectory). See Period ID Generation.
Override the default rotation policy for specific VMs. Format is a bash associative array:
declare -gA VM_POLICY
VM_POLICY["critical-db"]="accumulate" # Never delete
VM_POLICY["template-win11"]="never" # Exclude from backupsSee the rotation policy options above for available values.
Exclude VMs by name using glob patterns. Format: EXCLUDE_PATTERNS+=("pattern").
EXCLUDE_PATTERNS+=("test-*") # Exclude test VMs
EXCLUDE_PATTERNS+=("*-template") # Exclude templatesOne glob pattern per line. # comments and blank lines ignored. VMs matching any pattern are silently skipped (logged as "excluded"). Useful for database servers (TRIM latency spikes), legacy guests with buggy VirtIO drivers, or VMs with known guest agent issues. VMs without an agent are skipped automatically — this file is for VMs that have an agent but should be excluded.
| Setting | Default | Description |
|---|---|---|
EMAIL_ENABLED |
no |
Master switch (disabled by default; set to yes to enable) |
EMAIL_RECIPIENT |
(required) | Destination address |
EMAIL_SENDER |
(required) | From address |
EMAIL_HOSTNAME |
system hostname | Override hostname in subject |
EMAIL_SUBJECT_PREFIX |
[VM Backup] |
Subject line prefix |
EMAIL_ON_SUCCESS |
yes |
Send on successful backup |
EMAIL_ON_FAILURE |
yes |
Send on failure |
Replication summaries and the (future) disk-usage section are rendered automatically when their underlying data is present — there is no per-section toggle.
Global settings apply to all destinations. Per-destination settings use a numbered DEST_N_ prefix.
| Setting | Default | Description |
|---|---|---|
REPLICATION_ENABLED |
no |
Master switch |
REPLICATION_ON_FAILURE |
continue |
continue / abort |
REPLICATION_SPACE_CHECK |
skip |
skip / warn / disabled |
REPLICATION_MIN_FREE_PERCENT |
10 |
Minimum free space % for warn mode |
Per-destination (replace N with 1, 2, etc.):
| Setting | Example | Description |
|---|---|---|
DEST_N_ENABLED |
no |
Enable this destination |
DEST_N_NAME |
local-backup |
Human-readable label |
DEST_N_TRANSPORT |
local |
Transport driver: local, ssh, nfs, smb |
DEST_N_PATH |
/mnt/backups |
Destination path (or HOST:PATH for ssh) |
DEST_N_SYNC_MODE |
mirror |
mirror (--delete) / accumulate |
DEST_N_BWLIMIT |
0 |
KB/s bandwidth limit (0=unlimited) |
DEST_N_VERIFY |
size |
none / size / checksum |
SSH destinations additionally require DEST_N_HOST, DEST_N_USER, DEST_N_PORT, and DEST_N_SSH_KEY.
Global settings apply to all cloud destinations. Per-destination settings use a numbered CLOUD_DEST_N_ prefix.
| Setting | Default | Description |
|---|---|---|
CLOUD_REPLICATION_ENABLED |
no |
Master switch |
CLOUD_REPLICATION_SCOPE |
everything |
everything / archives-only / monthly |
CLOUD_REPLICATION_SYNC_MODE |
mirror |
mirror / accumulate-all / accumulate-valid |
CLOUD_REPLICATION_POST_VERIFY |
checksum |
none / size / checksum |
CLOUD_REPLICATION_ON_FAILURE |
continue |
continue / abort |
CLOUD_REPLICATION_DEFAULT_BWLIMIT |
0 |
KB/s (0=unlimited) |
CLOUD_REPLICATION_EXPIRY_WARN_DAYS |
30 |
Token expiry warning threshold |
CLOUD_REPLICATION_EXPIRY_CRITICAL_DAYS |
7 |
Token expiry critical threshold |
CLOUD_REPLICATION_LOG_LEVEL |
(inherits) | Override main LOG_LEVEL for cloud verbosity |
CLOUD_REPLICATION_USE_LOCKFILE |
yes |
Concurrent replication lock |
CLOUD_REPLICATION_LOCK_TIMEOUT |
3600 |
Seconds before stale lock is broken |
CLOUD_REPLICATION_VM_EXCLUDE |
(empty) | Comma-separated VM names to skip |
CLOUD_REPLICATION_DRY_RUN |
no |
Preview mode |
Per-destination (replace N with 1, 2, etc.):
| Setting | Example | Description |
|---|---|---|
CLOUD_DEST_N_ENABLED |
no |
Enable this destination |
CLOUD_DEST_N_NAME |
sharepoint-backup |
Human-readable label |
CLOUD_DEST_N_PROVIDER |
sharepoint |
Provider type |
CLOUD_DEST_N_REMOTE |
sharepoint: |
rclone remote name |
CLOUD_DEST_N_PATH |
VMBackups |
Folder within the remote |
CLOUD_DEST_N_SCOPE |
(global) | Per-dest scope override |
CLOUD_DEST_N_SYNC_MODE |
(global) | Per-dest sync mode override |
CLOUD_DEST_N_BWLIMIT |
0 |
Per-dest bandwidth limit |
CLOUD_DEST_N_VERIFY |
(global) | Per-dest verify override |
CLOUD_DEST_N_MAX_SIZE |
250G |
Provider file size limit (e.g. SharePoint) |
CLOUD_DEST_N_SECRET_EXPIRY |
(empty) | Credential expiry date (YYYY-MM-DD) |
SharePoint requires delegated authentication (device code flow) for uploads longer than 1 hour.
Why Delegated Auth?
- Client credentials have a 1-hour hard token limit with no refresh
- Uploads >1 hour fail with
401 Unauthorized - Delegated auth provides automatic token refresh via 90-day refresh token
Follow these steps in order when deploying vmbackup to a new host or enabling cloud replication for the first time.
sudo apt install rcloneVerify it's installed:
rclone versionOpen the cloud replication config for your config instance (e.g., default):
sudo nano /opt/vmbackup/config/default/replication_cloud.confSet these values:
# Enable cloud replication
CLOUD_REPLICATION_ENABLED="yes"
# Configure destination
CLOUD_DEST_1_ENABLED="yes"
CLOUD_DEST_1_NAME="sharepoint-backup"
CLOUD_DEST_1_PROVIDER="sharepoint"
CLOUD_DEST_1_REMOTE="sharepoint:" # rclone remote name (must match step 3)
CLOUD_DEST_1_PATH="VMBackups" # folder name in SharePoint doc librarySave and close. The folder name in CLOUD_DEST_1_PATH will be created automatically on the first upload if it doesn't exist.
This interactive script configures the rclone remote and authenticates with SharePoint:
sudo /opt/vmbackup/cloud_transports/sharepoint_auth.sh --instance defaultThe script will:
-
Read
CLOUD_DEST_1_REMOTEandCLOUD_DEST_1_PATHfrom your config (step 2) -
Launch
rclone configinteractively — follow the prompts:Prompt Answer Storage type onedriveClient ID Leave blank (press Enter) Client secret Leave blank (press Enter) Region globalUse web browser to authenticate n(device code flow) -
Device code flow: The script displays a URL and a code. Open the URL in any browser (even on your phone), sign in with your Microsoft 365 account, and enter the code.
-
After authentication succeeds:
Prompt Answer Drive type sharepointSite URL Your SharePoint site URL (e.g., https://contoso.sharepoint.com/sites/backups)Document library Select from the listed libraries -
The script verifies the connection and checks for the destination folder.
sudo /opt/vmbackup/cloud_transports/sharepoint_auth.sh --test-onlyYou should see the remote listing and available space. If this works, cloud replication is ready.
sudo vmbackup --run --config-instance default --dry-runCheck the output for cloud replication steps. Remove --dry-run to run for real.
| Symptom | Cause | Fix |
|---|---|---|
401 Unauthorized during upload |
Token expired | Run sharepoint_auth.sh --instance default |
Failed to get token |
Server idle 90+ days, refresh token expired | Run sharepoint_auth.sh --instance default |
| Wrong folder | CLOUD_DEST_1_PATH doesn't match |
Edit replication_cloud.conf, no re-auth needed |
| Wrong SharePoint site | Site URL baked into rclone config | Run sharepoint_auth.sh to reconfigure |
SharePoint cloud replication has three independently configurable layers:
| Layer | What it controls | Where configured | When set |
|---|---|---|---|
| SharePoint Site URL | Which SharePoint site to connect to | rclone config (interactive) |
During initial setup or re-auth |
| Document Library | Which document library within the site | rclone config (stored as drive_id) |
During initial setup or re-auth |
| Folder | Folder within the document library | CLOUD_DEST_N_PATH in replication_cloud.conf |
Any time via config edit |
Example mapping:
rclone remote "sharepoint:" → https://contoso.sharepoint.com/sites/backups
└── Document Library: "Shared Documents" (drive_id in rclone.conf)
└── Folder: "VMBackups" (CLOUD_DEST_1_PATH in vmbackup config)
Interactive helper script for configuring or re-authenticating rclone with any SharePoint site. Supports multiple vmbackup config instances, auto-discovers settings, and works on headless servers via device code flow.
Usage:
# Auto-discover instances and choose interactively
sudo ./cloud_transports/sharepoint_auth.sh
# Use settings from a specific vmbackup config instance
sudo ./cloud_transports/sharepoint_auth.sh --instance dev
# Specify remote name and folder directly
sudo ./cloud_transports/sharepoint_auth.sh --remote sharepoint --folder VMBackups
# Test existing connection without re-authenticating
sudo ./cloud_transports/sharepoint_auth.sh --test-only
# Use a specific rclone config file
sudo ./cloud_transports/sharepoint_auth.sh --config /path/to/rclone.confOptions:
| Option | Default | Description |
|---|---|---|
--remote NAME |
sharepoint |
rclone remote name |
--folder PATH |
from config instance | Folder in document library to verify/create |
--instance NAME |
auto-detect | vmbackup config instance (e.g., dev, default) |
--config FILE |
/root/.config/rclone/rclone.conf |
rclone config file path |
--test-only |
— | Test existing connection, don't re-authenticate |
Instance auto-discovery: When run without --folder or --instance, the script scans config/*/replication_cloud.conf and presents a menu showing each instance's CLOUD_DEST_1_REMOTE and CLOUD_DEST_1_PATH.
Interactive rclone config flow:
During the rclone config step, you will choose:
- Storage type:
onedrive(Microsoft OneDrive) - Client ID/secret: Leave blank (uses rclone's default Microsoft app)
- Region:
global(unless in a special region) - Web browser auth:
n(triggers device code flow — works on headless servers) - Drive type:
sharepoint(SharePoint site documentLibrary) - Site URL: Your SharePoint site URL (e.g.,
https://contoso.sharepoint.com/sites/backups) - Document Library: Select from the listed libraries (stored as
drive_id)
Token Lifecycle:
flowchart TD
A["Initial Setup<br/>(one-time)"] --> B["Normal Operation<br/>(automatic)"]
B --> C{"Access token<br/>expires (1hr)"}
C --> D["rclone uses<br/>refresh token"]
D --> E["New access token<br/>issued"]
E --> C
B --> F{"Idle 90+ days?"}
F -->|No| C
F -->|Yes| G["Re-auth needed<br/>Run sharepoint_auth.sh"]
G --> A
When to re-run:
- Initial setup (first time on a new host)
- After 90+ days of inactivity (refresh token expired)
401 Unauthorizedor token expired errors- Changing to a different SharePoint site or document library
rclone Config Location: /root/.config/rclone/rclone.conf
Note: The rclone config is global (per-host, under root), not per-instance. Multiple vmbackup instances on the same host share the same rclone remote but differentiate by
CLOUD_DEST_N_PATH(folder) in each instance's config.
All modules share a unified logging system with configurable verbosity levels.
| Level | Numeric | Description |
|---|---|---|
ERROR |
3 | Errors only (quietest) |
WARN |
2 | Errors + warnings |
INFO |
1 | Normal operation (default) |
DEBUG |
0 | Verbose debugging |
Important: All messages are always written to the log file regardless of level. The LOG_LEVEL setting controls which messages appear on screen (stderr).
Set in vmbackup.conf per instance:
# config/default/vmbackup.conf - Production (quiet)
LOG_LEVEL="INFO"
# config/test/vmbackup.conf - Testing (verbose)
LOG_LEVEL="DEBUG"All modules inherit LOG_LEVEL from vmbackup.conf:
- vmbackup.sh: Uses
LOG_LEVELdirectly - replication_cloud_module.sh: Inherits
LOG_LEVELunlessCLOUD_REPLICATION_LOG_LEVELis explicitly set - Transport drivers: Delegate to main
log_*()functions, automatically respectLOG_LEVEL
For cloud-specific verbosity (e.g., debugging SharePoint issues while keeping main logs quiet):
# config/*/replication_cloud.conf
# Uncomment only if you need cloud-specific verbosity different from main LOG_LEVEL
CLOUD_REPLICATION_LOG_LEVEL="debug" # Override: debug|info|warn|errorAll log messages follow a consistent format:
[timestamp] [LEVEL] [module.sh] [function] message
Example output:
[2026-02-03 14:30:15 AEDT] [INFO] [vmbackup.sh] [perform_backup] Starting backup for win11-pro
[2026-02-03 14:30:16 AEDT] [DEBUG] [replication_cloud_module.sh] [replicate_to_cloud] Using transport: sharepoint
[2026-02-03 14:30:17 AEDT] [INFO] [cloud_transport_sharepoint.sh] [upload_files] Uploading 3 files to SharePoint
| Instance | LOG_LEVEL |
Use Case |
|---|---|---|
default |
INFO |
Production - clean output |
test |
DEBUG |
Development - full verbosity |
template |
INFO |
Documentation reference |
virsh domstate determines the VM's power state. The backup method and type follow from the state and agent availability:
| VM State | Condition | Method | Backup Type |
|---|---|---|---|
| Running | QEMU agent available | FSFREEZE + backup | auto or full |
| Running | No QEMU agent | Pause VM + backup | auto or full |
| Shut off | Disk changed since last backup | — | copy |
| Shut off | Disk unchanged | — | Skip |
| Paused | — | Treated as running | auto or full |
With QEMU Guest Agent:
- Uses
virsh qemu-agent-commandto verify agent responsiveness - Enables FSFREEZE for application-consistent snapshots
- Backup type:
auto(incremental) orfull(boundary)
Without QEMU Guest Agent:
- VM is paused during backup to ensure crash-consistent snapshot
- Pause state monitored to ensure resume after backup completes
- Backup type:
autoorfull
Disk Change Detection:
Compares disk mtime against last backup timestamp
├── mtime > last_backup → Disk changed → ARCHIVE + COPY backup
└── mtime ≤ last_backup → Unchanged → SKIP backup
Key Behaviors:
- Clean shutdown always modifies disk (mtime updates)
- First offline day after running = archive chain + copy backup
- Subsequent offline days with no changes = skip backup
- Copy backup preserved, not chain
Treated identically to running VMs. The script's pause/resume logic handles backup coordination.
| Condition | Backup Type | Command Flag |
|---|---|---|
| Month boundary (new month) | full |
virtnbdbackup --full |
| Day 01 AND no existing valid data | full |
virtnbdbackup --full |
| Day 01 AND existing valid chain | auto |
virtnbdbackup --auto |
| Recovery flag present | full |
virtnbdbackup --full |
| Offline VM with changes | copy |
virtnbdbackup --copy |
| First backup ever | full |
virtnbdbackup --full |
| Normal daily backup | auto |
virtnbdbackup --auto |
flowchart TD
A["Backup Attempt"] --> B{"Result?"}
B -->|Success| C["Complete"]
B -->|Failure| D{"Backup type?"}
D -->|AUTO| E["Convert to FULL<br/>Retry"]
E --> A
D -->|FULL| F{"Retries left?"}
F -->|Yes| G["Archive chain<br/>Re-validate<br/>Wait"] --> A
F -->|No| H["Failed"]
| Location | Purpose | Created By |
|---|---|---|
/var/lib/libvirt/qemu/checkpoint/<vm>/ |
Primary libvirt checkpoint metadata | virsh checkpoint-create |
<backup_dir>/checkpoints/ |
Backup-local checkpoint copy | virtnbdbackup |
<backup_dir>/<vm>.cpt |
JSON array of checkpoint names | virtnbdbackup |
| QEMU qcow2 bitmaps | Dirty block tracking inside disk | QEMU/libvirt |
The first backup creates a full image (vda.full.data), the initial checkpoint (checkpoint.0), and the checkpoint list (vm.cpt). Subsequent backups append incrementals (vda.inc.*.data) and new checkpoints, growing the chain. At the period boundary the entire chain is archived, all checkpoints are cleared, and a fresh full backup starts the next chain.
The validate_backup_state() function performs 5-phase validation:
| State | Meaning | Action |
|---|---|---|
clean |
No issues detected | Proceed with backup |
copy_backup |
Valid offline copy backup exists | Archive copy → FULL backup |
stale_metadata |
Old metadata without data | Clean metadata → FULL |
broken_chain |
Checkpoint/bitmap mismatch | Archive chain → FULL |
missing_backup_data |
Checkpoints but no .data files | FULL backup |
incomplete_backup |
Partial .partial files |
Clean partial → FULL |
Each rotation policy uses a period format that determines the backup directory name and when chains are archived:
| Policy | Period Format | Period Boundary | Example ID |
|---|---|---|---|
daily |
YYYYMMDD |
New day | 20260206 |
weekly |
YYYY-Www (ISO) |
Monday | 2026-W06 |
monthly |
YYYYMM |
1st of month | 202602 |
accumulate |
None | None (chain grows) | N/A |
The get_period_id() function generates these identifiers:
get_period_id "daily" "2026-02-06" → "20260206"
get_period_id "weekly" "2026-02-06" → "2026-W06"
get_period_id "monthly" "2026-02-06" → "202602"
get_period_id "monthly" "2026-03-01" → "202603" # ← PERIOD BOUNDARYWhen the current date crosses a period boundary, the active chain is archived and a new full backup starts. Retention settings (RETENTION_DAYS, RETENTION_WEEKS, RETENTION_MONTHS) control how many period folders are kept. See Rotation Policies and Tier 2: Orphaned Policy Retention for details.
This section provides exhaustive documentation of how backup chains are created, archived, and managed over time, including the effects of policy changes.
<backup_dir>/.archives/
├── chain-2026-01-14/ # Archived chain from Jan 14
│ ├── checkpoints/
│ │ ├── virtnbdbackup.0.xml
│ │ ├── virtnbdbackup.1.xml
│ │ └── virtnbdbackup.2.xml
│ ├── vda.full.data
│ ├── vda.inc.virtnbdbackup.1.data
│ ├── vda.inc.virtnbdbackup.2.data
│ └── vm.cpt
├── chain-2026-01-14.1/ # Collision handling (same day)
└── chain-2026-01-17/ # Another archived chain
.archives/chain-YYYY-MM-DD[.N]/
- Date is when the archive was created (not when chain started)
.Nsuffix disambiguates multiple archives created on the same day across separate invocations (collision handling)
Within a single vmbackup invocation, the same VM can never be archived twice. Before 0.6.0 a code path that re-entered the archival routine for an already-archived VM in the same run produced a spurious collision-suffixed directory (.archives/chain-<date>.1) holding a duplicate of the chain just moved. A per-VM in-session guard now short-circuits the second attempt, so a .N suffix only ever results from genuinely separate invocations landing on the same calendar day — never from a single run archiving one VM twice.
| File Pattern | Description | Archived? |
|---|---|---|
*.full.data |
Full backup data | Yes |
*.full.data.chksum |
Full backup checksum | Yes |
*.inc.virtnbdbackup.*.data |
Incremental data | Yes |
*.inc.*.data.chksum |
Incremental checksums | Yes |
*.cpt |
Checkpoint name list | Yes |
checkpoints/ |
Checkpoint XML directory | Yes |
*.qcow.json |
QCOW metadata | No — Recreated |
vmconfig.*.xml |
VM config snapshots | No — In config/ |
| Trigger | Condition | Action |
|---|---|---|
| Period boundary | New period (day/week/month based on policy) | Archive current chain, start FULL |
| Policy change | Rotation policy differs from last backup | Archive current chain, start FULL |
| Offline VM changes | Disk modified since last backup | Archive chain → Copy backup |
| Running VM full reset | Orphaned data detected | Archive before overwrite |
| Online transition | VM started, copy backup exists | Archive copy → Fresh FULL |
Chain with N incrementals:
Archive Size ≈ Full_Size + (N × Avg_Incremental_Size)
Example (27 restore points):
11.4 GiB (full) + 26 × 250 MB = 11.4 + 6.5 = ~18 GiB
Backups move through a defined lifecycle from creation to eventual removal. The retention system is responsible for deciding when data is eligible for deletion, and the file management layer is responsible for how that deletion is executed, audited, and surfaced to operators.
flowchart LR
create["Create<br/>Full or incremental<br/>backup"]
active["Active<br/>Active chain in<br/>period directory"]
archive["Archive<br/>Period rolls,<br/>chain archived<br/>to .archives/"]
retain{"Retain / Delete<br/>Tier 1: Count<br/>Tier 2: Orphan<br/>Manual: CLI"}
create --> active --> archive --> retain
retain -->|soft-delete| mark["Soft-mark<br/>in DB (review)"]
retain -->|automated| rm["rm -rf<br/>period directory"]
mark -->|confirm| rm
rm --> tombstone["Tombstone in<br/>chain_health<br/>+ audit logs"]
Retention runs automatically for every VM in every session, not only after successful backups. Three entry points cover the full lifecycle:
post_backup_hook() (vmbackup_integration.sh)
├── run_retention_for_vm() Tier 1: Active policy retention
│ ├── get_vm_periods() List period dirs matching current policy
│ ├── count > retention_limit? Count-based threshold
│ └── _remove_period() Delete oldest excess periods
│
└── run_orphan_retention_for_vm() Tier 2: Orphaned policy retention
├── get_orphaned_periods() Find dirs from previous rotation policies
├── calculate_orphan_age() Age-based threshold (DB-backed)
└── _remove_orphan_period() Delete orphans exceeding max age
run_retention_for_unbacked_vm() (modules/retention_module.sh)
Invoked from the skip and exclude paths in vmbackup.sh main() so that
VMs which never reach post_backup_hook() still get retention enforced.
├── _remove_empty_period_dirs() Stub cleanup (zero-data period dirs)
├── run_retention_for_vm() Tier 1 (skipped only — excluded VMs
│ short-circuit at policy=never)
└── run_orphan_retention_for_vm() Tier 2
The trigger argument propagates from each entry point through every helper into the retention_events.triggered_by audit column. Defined values: post_backup, skipped, excluded, prune, orphan_retention, orphan_dir. The action column adds remove_stub for empty-period cleanup.
Removes the oldest period directories when the count exceeds the retention limit for the current rotation policy.
| Policy | Retention Setting | Default | Effect |
|---|---|---|---|
daily |
RETENTION_DAYS |
7 | Keep 7 daily period folders |
weekly |
RETENTION_WEEKS |
4 | Keep 4 weekly period folders |
monthly |
RETENTION_MONTHS |
3 | Keep 3 monthly period folders |
accumulate |
N/A | N/A | Chain depth limit only (see §8) |
never |
N/A | N/A | Excluded — no backups created |
Decision flow: count_vm_periods() counts filesystem directories matching the current policy format → if count > limit → head -n $excess selects the oldest → _remove_period() on each.
When a VM's rotation policy changes (e.g., weekly → monthly), old-format period directories become orphaned. Tier 2 uses age-based cleanup:
| Setting | Default | Purpose |
|---|---|---|
RETENTION_ORPHAN_ENABLED |
true |
Master switch |
RETENTION_ORPHAN_MIN_AGE_DAYS |
7 | Minimum age before deletion eligible |
RETENTION_ORPHAN_MAX_AGE_DAYS |
90 | Delete orphans older than this |
RETENTION_ORPHAN_DRY_RUN |
false |
Preview mode |
Decision flow: get_orphaned_periods() finds dirs not matching current policy format → calculate_orphan_age() queries DB for last successful backup age → age ≥ max_age → delete; min_age ≤ age < max_age → aging (kept); age < min_age → protected.
Policy change note: When a VM's rotation policy changes (e.g., daily → weekly), old-format period directories become orphans automatically. Tier 2 protects them for
MIN_AGEdays, then deletes them afterMAX_AGEdays. Theaccumulatepolicy uses chain depth limits instead of period-based retention, so it does not generate orphans.
Both tiers ultimately call _remove_period() or _remove_orphan_period(), which follow the same pipeline:
_remove_period(vm_name, period_id, dry_run)
│
├── [[ ! -d "$period_dir" ]] → return 0 # Already gone
│
├── dry_run == "true" → log + return 0 # Preview only
│
├── _is_safe_to_remove() # Safety validation:
│ ├── Path under BACKUP_PATH? # Prevent rm -rf /
│ ├── Not BACKUP_PATH itself? # Prevent rm -rf $BACKUP_PATH
│ ├── Is a directory? # Sanity check
│ └── Depth ≥ 2 (VM/period)? # Prevent rm -rf vm_dir/
│
├── sqlite_mark_chain_deleted() # DB: chain_status → 'deleted'
│
├── rm -rf "$period_dir" # Filesystem removal
│
├── log_retention_action() # Audit: retention_events table
│
└── log_file_operation() # Audit: file_operations table
Every retention action produces records in multiple tables, providing a complete audit history:
Records every retention decision, including dry runs and failures:
-- Example: Tier 1 deletes an old monthly period
INSERT INTO retention_events (
session_id, vm_name, action, target_type, target_path,
target_period, rotation_policy, retention_limit, current_count,
age_days, freed_bytes, triggered_by, success
) VALUES (
42, 'web-server', 'delete', 'period',
'/mnt/backup/vms/web-server/202607',
'202607', 'monthly', 6, 7,
210, 5368709120, '_remove_period', 1
);Records every file-level operation (create, move, delete):
-- Logged alongside the retention_events record
INSERT INTO file_operations (
session_id, operation, vm_name, source_path,
file_type, file_size_bytes, reason, triggered_by, success
) VALUES (
42, 'delete', 'web-server',
'/mnt/backup/vms/web-server/202607',
'directory', 5368709120, 'Retention cleanup', '_remove_period', 1
);Chain rows persist as tombstones after deletion, recording that the data existed and when it was removed:
| State | Meaning | Reversible |
|---|---|---|
active |
Chain is live, backups appending | N/A |
archived |
Chain archived (period rolled, policy change, error recovery) | No — archival is a move |
marked |
Soft-marked in DB for pending removal (review/grace window before deletion) | Yes — until the period is removed |
broken |
Chain integrity compromised (signal, checkpoint corruption) | No — needs new full |
deleted |
Period removed by retention or manual action | No — files gone from disk |
purged |
Removed via explicit --prune operation |
No — files gone from disk |
-- After _remove_period(), chain_health row becomes:
UPDATE chain_health SET
chain_status = 'deleted',
restorable_count = 0,
break_reason = 'retention',
deleted_at = '2026-02-17 03:00:00',
updated_at = '2026-02-17 03:00:00'
WHERE vm_name = 'web-server' AND period_id = '202607';
-- Row is NEVER deleted — it serves as a permanent tombstonearchive_size_bytes population (0.6.0): the chain's archived byte count is now written to chain_health.archive_size_bytes at the moment of the archive transition, by both the active-path and the retention-path archive callers. Previously the retention-path caller left the column at 0 until a manual reconciliation ran, so storage reports under-counted archived data for chains rolled by retention. Both callers are now symmetric.
Standalone on-demand cleanup of backup data — archives, periods, or entire VMs — without running a backup session. Useful for reclaiming space on demand, removing decommissioned VMs, or surgically pruning specific archived chains. See Usage for CLI syntax and options.
| Target | What it deletes | Requires --vm |
Example |
|---|---|---|---|
list |
Nothing — read-only discovery view | No (all VMs) or Yes (single VM detail) | --prune list |
archives |
All .archives/ dirs across all periods |
No (all VMs) or Yes (one VM) | --prune archives --vm my-vm |
archives:<period> |
.archives/ in a specific period only |
Yes | --prune archives:202603 --vm my-vm |
chain:<name> |
One specific archived chain directory | Yes | --prune chain:chain-2026-01-14 --vm my-vm |
period:<period_id> |
Entire period directory (active chain + archives + config) | Yes | --prune period:202603 --vm my-vm |
all |
Everything for a VM (all periods, entire VM directory) | Yes | --prune all --vm my-vm |
Space-focused discovery view. Shows every purgeable target with its size so the operator can construct a --prune command. All-VMs view shows per-VM totals (period count, archive count, total size, archive size). Single-VM view adds per-period breakdown with individual archive chains and their sizes, plus copy-paste --prune commands as # comments.
Data sourced from filesystem only (find, du) — works even if the SQLite DB is missing.
list → choose target → --dry-run → execute
- Discover —
--prune listto see sizes and copy-paste commands - Preview —
--prune <target> --vm <name> --dry-runto see what would be removed - Execute —
--prune <target> --vm <name>(confirmation prompt, or--yesto skip)
| Guard | Behaviour |
|---|---|
| Keep-last protection | period: target refuses to delete the last remaining period for a VM. Use all to explicitly remove everything. |
| Confirmation prompt | Interactive Y/N prompt before any destructive operation. Bypass with --yes. |
_is_safe_to_remove() |
Same path safety validation used by automated retention — prevents removal of paths outside BACKUP_PATH. |
| Dry-run logging | Dry runs are logged (tagged DRY RUN) for audit purposes. |
All prune operations record audit rows in the same tables used by automated retention:
| Table | What's recorded |
|---|---|
chain_events |
chain_purged event for each archived chain removed |
period_events |
period_deleted event when a period is removed |
retention_events |
Action, target path, freed bytes, triggered_by=prune |
file_operations |
Filesystem-level record of each deletion |
chain_health |
chain_status='purged', purged_at timestamp (tombstone row persists) |
All prune operations log to <BACKUP_PATH>/_state/logs/vmprune.log — separate from the main backup session log. Each entry includes timestamp, VM name, target, and bytes freed.
Back up one or more specific VMs without running a full scheduled session. See Usage for CLI syntax, combination rules, and examples.
- VM validation — each VM in the list is validated against libvirt (
virsh dominfo). If any VM does not exist, vmbackup exits immediately with an error. No partial backups are started. - Policy enforcement — per-VM rotation policies are respected. A VM with
rotation_policy=neverwill be excluded even when explicitly targeted. The operator must change the policy first. - Replication skipped — targeted backups do not trigger local or cloud replication. If replication is needed after a targeted backup, run
--replicate-onlyas a separate invocation. - Session tracking — targeted sessions are logged in SQLite with
session_type='targeted'so they can be distinguished from scheduled full runs in queries and reports. - Email report — the email report fires as normal, showing only the targeted VM(s).
- Dry-run —
--vm web --dry-runworks as expected, previewing the targeted backup without writing anything.
Re-run replication without performing a backup. No VMs are processed, no retention runs, no FSTRIM — just the replication phase against existing backup data. See Usage for CLI syntax, scope options, and examples.
The following normal-session operations are bypassed entirely:
check_dependencies()— virtnbdbackup/virsh not needed- VM discovery (
virsh list) - FSTRIM optimisation
- Stale qemu-nbd cleanup
- All backup and retention logic
Replicate-only respects your existing replication configuration:
| Config state | Behaviour |
|---|---|
| Local enabled, cloud enabled | Both run (or whichever scope was requested) |
| Local enabled, cloud disabled | Cloud skipped silently if both; clean exit if --replicate-only cloud |
| Neither enabled | Exit 0 with log message — not an error, respects config |
REPLICATION_ORDER |
Honoured (simultaneous, local_first, cloud_first) |
| Cancel flag set | Checked before each replication phase — graceful abort |
A replicate-only session creates a sessions row with session_type='replicate_only'. Replication results log to replication_runs and replication_vms as usual. No vm_backups rows are created.
If email is configured, a report is sent with subject Replication Only — hostname — OK|FAILED, replication results only (no VM details), and a footer noting no backups were performed.
Read-only operational reporting against the SQLite database. No locks, no session lifecycle, no log files — --status queries the existing database and exits. Runs the lightest possible code path: config load, sqlite_init_readonly() (no WAL, no migrations), query, format, exit.
See Usage for CLI syntax and combination rules.
| Report | Command | What it shows |
|---|---|---|
| Sessions | --status |
Per-session blocks rendered by job type — backup (VM table), prune (retention summary), replicate-only (per-endpoint table), mixed ([standard +repl]) and incomplete. Instance-scoped by default. |
| VM history | --status --vm web |
Backup history for a specific VM (type, status, size, duration) |
| Failures | --status --failures |
VMs with failures and failure counts |
| Replication | --status --replication |
Replication runs (endpoint, transport, status, bytes, duration) |
| Chains | --status --chains |
Chain health per VM (active, archived, purged, checkpoints) |
| Chain detail | --status --chains --vm web |
Per-chain breakdown for one VM (period, status, policy) |
| Storage | --status --storage |
Storage per VM (policy, backup count, avg full/incr, total, current chain) |
| Policies | --status --policies |
Rotation policy summary (per-VM policy, retention limits, orphans) |
| Restores | --status --restores |
Restore sessions logged by vmrestore (VM, mode, status, exit code, dry-run, duration, target path) |
--days N sets the time window in days (default: 1 = today). Applies to sessions, VM history, failures, replication and restore reports. Storage, chains and policies report across all history regardless of --days.
--csv switches output to CSV with both raw and human-readable columns. Byte values get an adjacent _hr column (e.g. bytes_written,bytes_written_hr). Duration values get a _hr column (e.g. duration_sec,duration_sec_hr). The policies CSV includes instance-level context columns (instance_default_policy, instance_retention_limit, orphan_max_age_days, orphan_min_age_days). The sessions CSV appends three row-count columns (vm_rows, repl_rows, retention_rows) used for client-side classification.
--all-instances (sessions report only) opts out of the default CONFIG_INSTANCE filter and lists sessions from every instance that has written to the active database. Useful when multiple config instances share a BACKUP_PATH.
Terminal output is pipe-delimited and passed through column -t for aligned columns. Byte values are formatted to human-readable units (GiB, MiB, KiB). Duration values are formatted as Xm Ys or Xs.
The policies report prints a preamble before the table showing the instance-level defaults:
Instance default policy: monthly (RETENTION_MONTHLYS=3)
Orphan retention: enabled, max_age=90 days, min_age=7 days
If a query returns no data, --status prints No matching records. to stderr and exits 0.
# What happened today?
sudo vmbackup --status
# Last 7 days of sessions
sudo vmbackup --status --days 7
# Backup history for a specific VM
sudo vmbackup --status --vm web-server
# Any failures in the last 30 days?
sudo vmbackup --status --failures --days 30
# Replication status this week
sudo vmbackup --status --replication --days 7
# Chain health overview (all VMs, all time)
sudo vmbackup --status --chains
# Chain detail for one VM
sudo vmbackup --status --chains --vm web-server
# How much storage is each VM using?
sudo vmbackup --status --storage
# Rotation policy summary with retention status
sudo vmbackup --status --policies
# Restore history — what was restored, when, and whether it worked
sudo vmbackup --status --restores --days 30
# Export failures as CSV for a spreadsheet
sudo vmbackup --status --failures --days 30 --csv
# Export storage as CSV (includes raw bytes + human-readable)
sudo vmbackup --status --storage --csvThree files, strict separation:
| File | Layer | Responsibility |
|---|---|---|
vmbackup.sh |
CLI | Flag parsing, conflict validation, dispatch |
lib/sqlite_module.sh |
Data | Query functions returning pipe-delimited or CSV rows |
modules/status_module.sh |
Presentation | Formatting, column alignment, human-readable conversion |
Adding a new report requires one query function in sqlite_module.sh, one _status_* function in status_module.sh, and one case in the dispatch. No changes to vmbackup.sh flag parsing unless a new sub-mode flag is needed.
Failure detection runs at four stages:
- Session startup —
check_dependencies()aborts on missing tools,check_backup_destination()aborts if the backup path is unwritable,check_disk_space()aborts when free space drops below the configured thresholds (DISK_ABORT_PCT/DISK_ABORT_GB, defaults 20% / 10 GB) and warns at the warn thresholds (DISK_WARN_PCT/DISK_WARN_GB, defaults 30% / 50 GB), andcleanup_stale_locks()removes orphaned lock files. Pre-flight failures (unwritable destination, missing scratch dir, low disk) trigger an email report when notifications are enabled. - Per-VM validation —
pre_backup_hook()excludes VMs withpolicy=never,validate_backup_state()runs 5-phase state analysis, andprepare_backup_directory()cleans stale state. - Backup execution —
perform_backup()retries with AUTO→FULL conversion on failure, archives broken chains, re-validates, and retries. - Signal handling — SIGTERM/SIGINT release locks and log recovery guidance. The next run auto-recovers any stale state left behind.
| Problem | Detection | Auto-Recovery |
|---|---|---|
| Stale lock file | PID check | Delete if process dead |
| Orphaned QEMU checkpoint | No matching backup data | Delete checkpoint metadata |
| Stale backup metadata | Metadata without data | Delete, force FULL |
| Broken checkpoint chain | Bitmap mismatch | Archive chain, force FULL |
| Incomplete backup | .partial files |
Clean, force FULL |
| AUTO backup fails | Exit code | Convert to FULL, retry |
| FULL backup fails | Exit code | Retry with delay |
| Script interrupted | Signal handler | Next run recovers |
vmbackup uses SGID (setgid) on backup directories so that every new file and subdirectory automatically inherits the backup group. No post-hoc chown or chgrp is needed on individual files.
| Layer | Mechanism | Effect |
|---|---|---|
| vmbackup.sh | umask 027 |
Files: 640 (rw-r-----), Dirs: 750 (rwxr-x---) |
| vmbackup.sh | ensure_backup_path_sgid() |
BACKUP_PATH gets root:backup 2750 on first run |
| vmbackup.sh | set_backup_permissions() |
Applies root:backup + SGID to directories (recursive sweep) |
| systemd | UMask=0027 |
Belt-and-suspenders with script umask |
| Makefile/dpkg | install -m 750/640 |
Installed files are not world-accessible |
| postinst | chown -R root:backup /opt/vmbackup |
Group ownership set on install/upgrade |
When a directory has the SGID bit set (mode 2750, shown as drwxr-s---):
- New files created inside inherit the directory's group (not the creating process's group)
- New subdirectories inherit both the group and the SGID bit — propagation is automatic
- Combined with
umask 027, files are created asroot:backup 640and dirs asroot:backup 2750
This means vmbackup only needs to set SGID once on BACKUP_PATH — all subsequent mkdir, touch, file redirects, and virtnbdbackup output automatically inherit backup group ownership.
See BACKUP_PATH Setup in the Installation section. In short: mkdir -p the directory and vmbackup handles the rest.
On first run, ensure_backup_path_sgid() detects the directory lacks SGID or backup group ownership and automatically applies root:backup 2750. This runs before init_logging() so all subdirectories (_state/, _state/logs/, _state/temp/) are created under the correct group.
See User & Group Setup in the Installation section for setup instructions.
| Group | Purpose | Grant command |
|---|---|---|
backup |
Read access to all backups, logs, configs, and scripts | sudo usermod -aG backup <username> |
libvirt |
Read-only virsh access (VM status, virsh list, etc.) |
sudo usermod -aG libvirt <username> |
vmbackup.sh runs as root via systemd (required for libvirt, virtnbdbackup, and TPM access). All child processes — including virtnbdbackup — inherit the umask 027. Because BACKUP_PATH and its subdirectories have SGID set, all new files are created as root:backup 640 automatically.
Two functions handle permissions:
ensure_backup_path_sgid() — runs once at the start of main(), before init_logging():
- Checks if
BACKUP_PATHhasbackupgroup ownership and SGID bit - If not, applies
chown root:backupandchmod 2750 - Ensures all subsequent directory creation inherits the correct group
- Logs to stderr since the logging system isn't initialised yet
set_backup_permissions() — the recursive sweep and safety net:
| Mode | Invocation | Behaviour |
|---|---|---|
| Single-target | set_backup_permissions "/path" |
chown root:backup + chmod g+s on the named path |
| Recursive | set_backup_permissions "/path" --recursive |
Uses find to apply chown root:backup and chmod g+s to the entire tree |
Recursive mode excludes tpm-state/ directories (TPM private keys must stay root:root 600):
find "$target_path" \
-not -path '*/tpm-state/*' -not -path '*/tpm-state' \
-exec chown root:backup {} +
find "$target_path" -type d \
-not -path '*/tpm-state/*' -not -path '*/tpm-state' \
-exec chmod g+s {} +The function is a no-op if the backup group does not exist on the system (checked via getent group backup). All operations suppress errors (2>/dev/null || true) so NFS or filesystem permission failures do not abort the backup.
File modes are never changed by set_backup_permissions() — modes are controlled exclusively by:
umask 027(set at script top, inherited by all child processes including virtnbdbackup)- Explicit
chmodcalls for security-sensitive files (e.g.,chmod 600for BitLocker keys,chmod 640for TPM metadata) chmod g-sontpm-state/directories to strip SGID (TPM keys must not inherit backup group)
SGID handles group inheritance automatically. set_backup_permissions() is only called in 5 places as bootstrap and safety nets:
| Source File | Function | Target | Mode | Purpose |
|---|---|---|---|---|
vmbackup.sh |
main() → ensure_backup_path_sgid() |
BACKUP_PATH |
Single | First-run bootstrap |
vmbackup.sh |
init_logging() |
STATE_DIR, log dir, TEMP_DIR |
Single | Bootstrap before recursive sweep |
vmbackup.sh |
check_backup_destination() |
BACKUP_PATH (full tree) |
Recursive | Session-start sweep — catches anything created outside vmbackup |
vmbackup.sh |
perform_backup() |
Per-VM backup dir | Recursive | Safety net after virtnbdbackup (external tool) |
modules/tpm_backup_module.sh |
TPM backup functions | tpm-state/ dirs |
chmod g-s |
Strip SGID so TPM keys stay root:root |
No modules or library files call set_backup_permissions(). All file/directory ownership in modules is handled by SGID inheritance from the parent directory.
| Path | Owner:Group | Mode | Contents |
|---|---|---|---|
/opt/vmbackup/ |
root:backup |
750 |
Install tree |
/opt/vmbackup/vmbackup.sh |
root:backup |
750 |
Main script |
/opt/vmbackup/modules/*.sh |
root:backup |
640 |
Business logic modules |
/opt/vmbackup/lib/*.sh |
root:backup |
640 |
Shared libraries |
/opt/vmbackup/transports/*.sh |
root:backup |
750 |
Transport drivers |
/opt/vmbackup/config/default/*.conf |
root:backup |
640 |
Instance config (conffiles) |
/var/log/vmbackup/ |
root:backup |
750 |
Log directory |
/run/vmbackup/ |
root:backup |
750 |
Lock files |
$BACKUP_PATH/ |
root:backup |
2750 |
Backup data root (SGID) |
$BACKUP_PATH/<vm>/ |
root:backup |
2750 |
Per-VM backup dirs (SGID inherited) |
$BACKUP_PATH/<vm>/config/ |
root:backup |
2750 |
VM libvirt XML snapshots |
$BACKUP_PATH/<vm>/.chain-*/ |
root:backup |
2750 |
Archived chain dirs |
$BACKUP_PATH/<vm>/tpm-state/ |
root:root |
750 |
TPM state root (SGID stripped) |
$BACKUP_PATH/<vm>/tpm-state/tpm2/ |
root:root |
600 |
TPM private keys |
$BACKUP_PATH/<vm>/tpm-state/BACKUP_METADATA.txt |
root:root |
640 |
TPM backup metadata |
$BACKUP_PATH/<vm>/tpm-state/bitlocker-recovery-keys.txt |
root:root |
600 |
BitLocker recovery keys |
$BACKUP_PATH/_state/ |
root:backup |
2750 |
State directory root |
$BACKUP_PATH/_state/backups/ |
root:backup |
2750 |
Backup state files |
$BACKUP_PATH/_state/locks/ |
root:backup |
2750 |
Per-VM lock files |
$BACKUP_PATH/_state/email/ |
root:backup |
2750 |
Email debug output |
TPM state files and BitLocker recovery keys contain sensitive cryptographic material and receive special handling that overrides the SGID permission model:
- TPM state directories (
tpm-state/): After creation, SGID is explicitly stripped withchmod g-sso the directory does not propagatebackupgroup ownership. Contents remainroot:root. - TPM private keys (
tpm-state/tpm2/): Backed up viasudo cpfrom swtpm directories (tss:tssowned). After copy, files remainroot:root 600— theset_backup_permissions --recursivecall on the parent VM dir explicitly excludestpm-state/viafind -not -pathfilters. - TPM metadata (
tpm-state/BACKUP_METADATA.txt): Written with explicitchmod 640— group-readable because it contains only informational text (swtpm version, backup timestamp, recovery instructions), not key material. - BitLocker recovery keys: Written with explicit
chmod 600andchown root:root— these remain root-only even within the backup group-readable tree.
The net result is that a user in the backup group can browse the backup tree, read VM configs and logs, but cannot read TPM private keys or BitLocker recovery keys.
Each config instance defines its own BACKUP_PATH. The .deb package does not create BACKUP_PATH — the user creates it with mkdir -p and vmbackup applies SGID automatically on first run via ensure_backup_path_sgid().
The following files are declared as conffiles in the .deb package. dpkg will not overwrite them on upgrade if they have been modified:
/opt/vmbackup/config/default/vmbackup.conf/opt/vmbackup/config/default/vm_overrides.conf/opt/vmbackup/config/default/email.conf/opt/vmbackup/config/default/exclude_patterns.conf/opt/vmbackup/config/default/fstrim_exclude.conf/opt/vmbackup/config/default/replication_cloud.conf/opt/vmbackup/config/default/replication_local.conf/etc/apparmor.d/local/abstractions/libvirt-qemu
User-created config instances (e.g., config/prod/) are not managed by dpkg and are never touched during upgrade.
When BACKUP_PATH is on an NFS mount, be aware of root_squash (the NFS default):
| NFS Export Option | Effect on vmbackup |
|---|---|
root_squash (default) |
All chgrp backup calls silently fail — files owned by nobody:nogroup |
no_root_squash |
chgrp backup works normally — correct root:backup ownership |
If your backup destination is NFS, you must export it with no_root_squash for the security model to function. Example NFS server export:
/mnt/backups 10.0.0.0/24(rw,sync,no_subtree_check,no_root_squash)
Without no_root_squash, vmbackup will still run successfully but all files will be owned by nobody:nogroup and the backup group access model provides no benefit.
For local replication destinations on NFS, the same applies. See replication_local.conf for per-destination notes.
The following paths within BACKUP_PATH are intentionally excluded from the backup group ownership model and remain root:root:
| Path | Mode | Contents | Reason |
|---|---|---|---|
*/tpm-state/tpm2/* |
600 | TPM private key material (tpm2-00.permall) |
Encryption keys |
*/tpm-state/BACKUP_METADATA.txt |
640 | TPM backup metadata and recovery instructions | Non-sensitive (group-readable) |
flowchart TD
A["backup_vm() calls<br/>backup_vm_tpm()"] --> B{"Has TPM?<br/>(virsh dumpxml)"}
B -->|No| C["Return<br/>(non-fatal)"]
B -->|Yes| D["Copy from<br/>/var/lib/libvirt/swtpm/uuid/"]
D --> E["Create tpm-state/<br/>+ BACKUP_METADATA.txt"]
E --> F{"Deduplication<br/>check"}
F -->|Changed| G["Keep full copy"]
F -->|Unchanged| H["Create symlink<br/>to previous"]
| File | Location | Purpose |
|---|---|---|
tpm2-* |
<backup>/tpm-state/ |
Raw TPM state files |
BACKUP_METADATA.txt |
<backup>/tpm-state/ |
Recovery instructions |
.tpm-backup-marker |
<backup>/ |
Restore identification |
bitlocker-recovery-keys.txt |
<backup>/tpm-state/ |
BitLocker recovery keys (when present) |
When backing up a Windows VM with a virtual TPM, vmbackup automatically extracts BitLocker recovery keys from the running guest via the QEMU guest agent. This ensures recovery keys are available even if the TPM state becomes unusable after restore.
All of the following must be true for extraction to occur:
BITLOCKER_KEY_EXTRACTION=yes(default)- VM is running
- QEMU guest agent is installed and responsive inside the guest
- Guest OS is Windows (detected via
guest-get-osinfo) - At least one volume has
Protection Status: Protection On
If any condition is not met, extraction is silently skipped — it never blocks the backup.
The extraction uses the QEMU guest agent's guest-exec command to run manage-bde.exe inside the Windows guest. First, manage-bde -status identifies volumes with BitLocker Protection On. For each protected volume, manage-bde -protectors -get retrieves the recovery key. All output is written to tpm-state/bitlocker-recovery-keys.txt as raw manage-bde output (not parsed), preserving protector IDs, backup types, and PCR validation profiles.
Each guest-exec call is a three-step async protocol: launch (returns PID), poll until "exited": true, and base64-decode the response. The _guest_exec_capture() helper encapsulates this with JSON escaping, timeout polling, and error handling.
- Permissions:
root:root 600— only root can read or write - Overwritten on every successful extraction — always reflects the latest backup run
- Lifecycle inherits from
tpm-state/(archived, pruned, and replicated with it)
After restoring a Windows VM, BitLocker may enter recovery mode if the VM has a new UUID, the TPM state is missing or corrupted, the virtual hardware changed, or BitLocker detects a Secure Boot policy change (PCR 7/11 mismatch). Windows will prompt for the 48-digit Numerical Password from the recovery key file.
| Variable | Default | Purpose |
|---|---|---|
BITLOCKER_KEY_EXTRACTION |
yes |
Enable/disable BitLocker key extraction |
BITLOCKER_EXEC_TIMEOUT |
30 |
Timeout (seconds) for each guest-exec command |
| Scenario | Behavior |
|---|---|
| VM shut off | Skipped — guest agent unreachable |
| No QEMU agent installed | Skipped — guest-get-osinfo fails |
| Linux guest | Skipped — os_id != mswindows |
| BitLocker not configured | Skipped — no volumes with Protection On |
| BitLocker suspended (Protection Off) | Skipped — no file written |
| Multiple encrypted volumes | All protected volumes extracted |
manage-bde fails |
Logged as WARN, returns 0 |
| Agent timeout | Respects BITLOCKER_EXEC_TIMEOUT, fails gracefully |
vmbackup.sh uses SQLite as the sole logging backend. This provides:
- Structured relational data for complex queries
- Session-level tracking with unique IDs
- VM-to-replication association for audit trails
- Faster querying for large backup histories
- Chain health tracking for restoration validation
- Complete exit path coverage - all backup outcomes logged
- Event tables for chain, period, file, retention, and config events
- Restore-session tracking —
vmrestorerecords every restore to the same catalogue (schema v2.2,restore_sessionstable); query via--status --restores
${BACKUP_PATH}/_state/vmbackup.db
Each backup instance maintains its own SQLite database.
Misplaced-database guard (0.6.0): vmbackup refuses to create the catalogue anywhere other than _state/ — specifically it will not initialise a database inside an .archives/ directory or under a period directory. A misconfigured or relocated BACKUP_PATH that pointed at a sub-tree could previously spawn a second catalogue that diverged silently from the canonical one, so chain state and status reports disagreed depending on where the tool was run. The guard fails loudly rather than creating a divergent database, keeping the single-source-of-truth invariant intact.
All timestamps stored in the database use UTC (YYYY-MM-DD HH:MM:SS). Log output uses local time with a timezone suffix.
| Context | Convention | Example |
|---|---|---|
DB writes (sqlite_module.sh) |
date -u '+%Y-%m-%d %H:%M:%S' |
2026-02-16 08:30:00 |
DB reads via date -d |
Append " UTC" suffix |
date -d "2026-02-16 08:30:00 UTC" +%s |
| Session ID | date +%s (epoch, always UTC) |
1739692203 |
Log timestamps (log_msg) |
date '+%Y-%m-%d %H:%M:%S %Z' (local + TZ) |
2026-02-16 19:30:00 AEDT |
Rotation period_id |
date '+%Y%m%d' etc. (local, intentional) |
20260216 |
Databases are migrated automatically to the current schema (v2.2) on first write; very old (pre-1.6) databases written in local time are converted to UTC as part of that migration. Downgrade is not supported.
The SQLite module is loaded automatically by vmbackup.sh if:
- The
lib/sqlite_module.shfile exists - The
sqlite3command is available
If sqlite3 is not installed, backup operations continue normally without SQLite logging.
# Manual module loading (if needed)
source "$SCRIPT_DIR/lib/sqlite_module.sh"
sqlite_init_databaseThe schema and all table definitions live in lib/sqlite_module.sh (currently schema v2.2, which added the restore_sessions table for vmrestore logging). Use sqlite3 vmbackup.db ".schema" to inspect the current schema, or sqlite_get_schema_version to check the version.
restore_point_id shape change (0.6.0): with the per-VM chain-manifest subsystem retired, the restore_point_id for newly written rows changes shape from <vm>:<period>:chain-YYYY-MM-DD-HHMMSS:N to <vm>:<period>:chain-YYYY-MM-DD:N, aligning the identifier with the on-disk archive directory naming. This is operator-visible only in --status output; no code parses the identifier, and existing rows are left unchanged.
Most of these queries are available directly from the command line via --status — see Status Reports. The raw SQL below is useful for custom queries, ad-hoc investigation, or integration with external tools.
All queries target the SQLite database at $BACKUP_PATH/_state/vmbackup.db. Set the path once:
DB="$BACKUP_PATH/_state/vmbackup.db"All queries return tabular data suitable for sqlite3 -header -column (human-readable) or sqlite3 -separator '|' (pipe-delimited, for shell parsing).
Current session status.
sqlite3 -header -column "$DB" "
SELECT
date(s.start_time) as date,
s.instance,
s.status as session_status,
s.vms_total,
s.vms_success,
s.vms_failed,
s.vms_skipped,
s.vms_excluded,
COALESCE(s.bytes_total,0) as bytes_total
FROM sessions s
WHERE date(s.start_time) = date('now')
ORDER BY s.start_time DESC
LIMIT 5;"Wrapper function: sqlite_query_today_sessions
Per-VM drill-down — last N backup runs with status, size, duration.
sqlite3 -header -column "$DB" "
SELECT s.start_time, vb.backup_type, vb.status,
vb.bytes_written, vb.duration_sec, vb.restore_points
FROM vm_backups vb
JOIN sessions s ON vb.session_id = s.id
WHERE vb.vm_name = 'web-server'
ORDER BY s.start_time DESC
LIMIT 20;"Wrapper function: sqlite_query_vm_history "web-server" 20
Recent restores logged by vmrestore — the same data surfaced by vmbackup --status --restores.
sqlite3 -header -column "$DB" "
SELECT start_time, vm_name, restore_mode, status,
exit_code, dry_run, duration_sec, target_path
FROM restore_sessions
WHERE start_time >= datetime('now', '-30 days')
ORDER BY start_time DESC
LIMIT 20;"Wrapper function: sqlite_query_recent_restores 30
Aggregated monthly stats.
sqlite3 -header -column "$DB" "
SELECT
vm_name,
COUNT(*) as total_runs,
SUM(CASE WHEN status='success' THEN 1 ELSE 0 END) as success,
SUM(CASE WHEN status='failed' THEN 1 ELSE 0 END) as failed,
SUM(CASE WHEN status='skipped' THEN 1 ELSE 0 END) as skipped,
SUM(CASE WHEN status='excluded' THEN 1 ELSE 0 END) as excluded,
COALESCE(SUM(bytes_written),0) as total_bytes,
MAX(restore_points) as peak_restore_points,
ROUND(AVG(duration_sec),1) as avg_duration_sec,
CASE
WHEN SUM(CASE WHEN status='failed' THEN 1 ELSE 0 END) > 0 THEN 'DEGRADED'
ELSE 'HEALTHY'
END as health
FROM vm_backups
WHERE created_at >= '2026-02-01' AND created_at < '2026-03-01'
GROUP BY vm_name
ORDER BY vm_name;"VMs with failures in the last N days — ideal for alert panels.
sqlite3 -header -column "$DB" "
SELECT vm_name, COUNT(*) as failures,
MAX(s.start_time) as last_failure,
GROUP_CONCAT(DISTINCT COALESCE(vb.error_code,'unknown')) as error_types
FROM vm_backups vb
JOIN sessions s ON vb.session_id = s.id
WHERE vb.status = 'failed'
AND s.start_time >= date('now', '-7 days')
GROUP BY vm_name
ORDER BY failures DESC;"Wrapper function: sqlite_query_recent_failures 7
Aggregated stats for any period — good for weekly/monthly dashboards.
sqlite3 -header -column "$DB" "
SELECT
COUNT(DISTINCT s.id) as sessions,
SUM(CASE WHEN vb.status='success' THEN 1 ELSE 0 END) as successful,
SUM(CASE WHEN vb.status='failed' THEN 1 ELSE 0 END) as failed,
SUM(vb.bytes_written) as total_bytes,
ROUND(AVG(vb.duration_sec),1) as avg_duration_sec
FROM vm_backups vb
JOIN sessions s ON vb.session_id = s.id
WHERE date(s.start_time) BETWEEN '2026-02-01' AND '2026-02-28';"Current session replication results — one row per endpoint.
sqlite3 -header -column "$DB" "
SELECT rr.endpoint_name, rr.endpoint_type, rr.transport,
rr.status, rr.bytes_transferred, rr.files_transferred,
rr.duration_sec
FROM replication_runs rr
JOIN sessions s ON rr.session_id = s.id
WHERE date(s.start_time) = date('now')
ORDER BY rr.endpoint_type, rr.endpoint_name;"Wrapper function: sqlite_query_today_replications
Restorable chains per VM — critical for restoration planning.
sqlite3 -header -column "$DB" "
SELECT vm_name, period_id, chain_status, total_checkpoints,
restorable_count, chain_location, last_backup
FROM chain_health
WHERE chain_status IN ('active','archived')
ORDER BY vm_name, period_id;"Wrapper function: sqlite_get_restorable_chains "web-server"
Per-day storage written.
sqlite3 -header -column "$DB" "
SELECT date(s.start_time) as day,
COUNT(*) as backups,
SUM(vb.bytes_written) as bytes_written,
ROUND(AVG(vb.duration_sec),1) as avg_sec
FROM vm_backups vb
JOIN sessions s ON vb.session_id = s.id
WHERE s.start_time >= date('now','-30 days')
GROUP BY day
ORDER BY day;"Cleanup events — what was deleted and why.
sqlite3 -header -column "$DB" "
SELECT timestamp, vm_name, action, target_type,
target_path, freed_bytes, triggered_by
FROM retention_events
WHERE timestamp >= date('now','-7 days')
ORDER BY timestamp DESC;"Quick health check — shows the last successful backup for every VM.
sqlite3 -header -column "$DB" "
SELECT vb.vm_name,
MAX(s.start_time) as last_success,
vb.backup_type,
vb.bytes_written
FROM vm_backups vb
JOIN sessions s ON vb.session_id = s.id
WHERE vb.status = 'success'
GROUP BY vb.vm_name
ORDER BY last_success DESC;"The session summary accurately tracks VMs in four categories:
| Category | Symbol | Meaning |
|---|---|---|
| Backed Up | ✓ | VM was actually backed up |
| Excluded | ○ | VM excluded by policy (never) or pattern |
| Skipped | ◇ | Offline VM with unchanged disks |
| Failed | ✗ | Backup attempt failed |
| Code | Meaning |
|---|---|
0 |
Backup completed successfully (or offline unchanged skip) |
1 |
Backup failed with error |
2 |
VM excluded by policy (BACKUP_RC_EXCLUDED — the only named constant) |
╔══════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║ VM BACKUP SESSION SUMMARY ║
╠══════════════════════════════════════════════════════════════════════════════════════════════════════════╣
║ Total VMs: 10
║
║ ✓ Backed Up: 3 (daily: 0, weekly: 0, monthly: 3, accumulate: 0)
║ ○ Excluded: 5 (policy=never or pattern match)
║ ◇ Skipped: 2 (offline/unchanged)
║ ✗ Failed: 0
╠══════════════════════════════════════════════════════════════════════════════════════════════════════════╣
║ VM NAME │ STATUS │ TYPE │ POLICY │ DURATION │ CHKPTS │ SIZE │ ERROR
╠══════════════════════════════════════════════════════════════════════════════════════════════════════════╣
║ web-server │ SUCCESS │ auto │ monthly │ 00:05:12 │ 6 │ 1.2GiB │
║ database │ SUCCESS │ auto │ daily │ 00:12:34 │ 15 │ 4.5GiB │
║ dev-vm │ SUCCESS │ full │ monthly │ 00:23:45 │ 1 │ 8.9GiB │
║ template-win11 │ EXCLUDED │ n/a │ never │ 00:00:00 │ 0 │ N/A │
║ offline-vm │ SKIPPED │ n/a │ monthly │ 00:00:00 │ 3 │ N/A │
╚══════════════════════════════════════════════════════════════════════════════════════════════════════════╝
When running in --replicate-only mode, the session summary uses a simplified format with no VM table:
╔═══════════════════════════════════════════════════╗
║ REPLICATION-ONLY SESSION SUMMARY ║
╠═══════════════════════════════════════════════════╣
║ Mode: replicate-only (both)
║ Host: my-host
║ Config: default
║ Duration: 3m 22s
║
║ Local Replication: OK (2 destinations)
║ Cloud Replication: FAILED (1 destination)
║
║ Status: FAILED
╚═══════════════════════════════════════════════════╝
The email report module (email_report_module.sh) generates plaintext reports and sends them via msmtp. It sources backup, replication and chain data primarily from the SQLite database using session-scoped queries, supplemented by direct filesystem reads for storage figures (df/du) and the session-filtered log attachment.
Data sources:
| Data | Source | Function |
|---|---|---|
| VM backup details | SQLite vm_backups table |
sqlite_query_session_vm_backups() |
| Subject line counts | SQLite sessions table |
sqlite_query_session_summary() |
| Replication details | SQLite replication_runs table |
sqlite_query_session_replication() |
| Chain health | SQLite chain_health table |
sqlite_chain_health_summary() |
| Storage info | Filesystem (df/du) |
Direct |
| Log attachment | vmbackup.log (session-filtered) |
get_todays_log() |
Subject Line:
- Success:
VM Backup - 3 backed up, 5 excluded - OK - Failure:
VM Backup - 3 backed up, 5 excluded, 1 FAILED
Body Structure:
VM Backup Report
================
Host: <hostname>
Date: YYYY-MM-DD
Started: HH:MM:SS
Finished: HH:MM:SS
Duration: Xh Ym Zs
Total VMs: 10
✓ Backed Up: 3
○ Excluded: 5 (policy=never)
◇ Skipped: 2 (offline unchanged)
✗ Failed: 0
Total Written: X.X GiB
────────────────────────────────
VM: <vm-name>
────────────────────────────────
Status: SUCCESS
OS: <detected OS>
VM State: running
Backup Type: incremental
Consistency: Agent
Duration: Xm Ys
...
| Variable | Description |
|---|---|
EMAIL_RECIPIENT |
Destination email address |
EMAIL_SENDER |
From address |
EMAIL_HOSTNAME |
Host identifier in reports |
When running in --replicate-only mode, the email report uses a different subject and simplified body:
Subject Line:
- Success:
Replication Only — hostname — OK - Failure:
Replication Only — hostname — FAILED
Body: Contains only the replication results sections (local and/or cloud). No VM backup details. A footer line reads "No backups were performed — this was a replication-only session."
| Cause | Detection | Recovery |
|---|---|---|
| systemd timeout | SIGTERM | Next run cleans up |
| User kill (Ctrl+C) | SIGINT | Next run cleans up |
| OOM killer | Process death | Next run cleans up |
| System reboot | No signal | Next run cleans up |
- Stale lock files (
vmbackup-<vm>.lock) - Orphaned QEMU checkpoints
- Partial
.datafiles - Orphaned metadata (
*.copy.qcow.json)
SIGINT/SIGTERM during fstrim or VSS pauses previously triggered the interrupt handler, which would falsely mark the backup chain as broken even though no backup was in progress. A _BACKUP_IN_PROGRESS flag now guards against this:
- Set to
1whenperform_backup()begins - Cleared to
0whenperform_backup()completes _log_interrupted_chain()only marks chain broken when_BACKUP_IN_PROGRESS=1
VMs without a QEMU agent are paused before backup and resumed after. If the process is interrupted between pause and resume, the VM remains paused. This applies to all interrupt types (SIGTERM, SIGINT, SIGKILL, OOM, reboot). The VM must be resumed manually with virsh resume <vm>.
On next run:
cleanup_stale_locks()- Remove locks with dead PIDscleanup_orphaned_checkpoints()- Remove checkpoints without datavalidate_backup_state()- Detect incomplete stateprepare_backup_directory()- Clean partial files- Proceed with FULL backup
All paths below are relative to $BACKUP_PATH/ (instance-specific backup directory).
| File | Location | Purpose |
|---|---|---|
vda.full.data |
<vm>/ |
Full backup image |
vda.copy.data |
<vm>/ |
Offline copy backup |
vda.inc.virtnbdbackup.N.data |
<vm>/ |
Incremental N |
<vm>.cpt |
<vm>/ |
Checkpoint list (JSON) |
virtnbdbackup.N.xml |
<vm>/checkpoints/ |
Checkpoint metadata |
| File | Location | Purpose |
|---|---|---|
.agent-status |
<vm>/ |
Cached agent availability |
.full-backup-month |
<vm>/ |
Month of last full |
vmbackup.db |
_state/ |
SQLite logging database |
.last_month |
_state/ |
Month boundary detection file |
cloud_replication_state.txt |
_state/ |
Cloud replication tracking (invalidated at session start, regenerated when cloud replication runs) |
local_replication_state.txt |
_state/ |
Local replication tracking (invalidated at session start, regenerated when local replication runs) |
vmbackup-<vm>.lock |
_state/locks/ |
Per-VM lock file |
vmbackup creates extensive logs across several directories. Per-VM and replication logs are automatically cleaned up by cleanup_old_logs() after LOG_KEEP_DAYS (default 30). The central vmbackup.log and vmprune.log files are rotated when they exceed LOG_MAX_BYTES (default 50 MiB) — the oversized file is renamed to <name>.<epoch> and a fresh empty file is created. Rotated files then age out via the same LOG_KEEP_DAYS rule.
| File | Location | Purpose | Accumulation |
|---|---|---|---|
vmbackup.log |
_state/logs/ |
Main session log (appended each run) | Rotated at LOG_MAX_BYTES; rotated files aged out via LOG_KEEP_DAYS |
vmbackup.log.<epoch> |
_state/logs/ |
Previously-rotated session log | Deleted after LOG_KEEP_DAYS |
virtnbdbackup output is captured per-VM per-backup:
| Pattern | Location | Purpose |
|---|---|---|
backup_<vm>_<epoch>.log |
_state/logs/ |
virtnbdbackup stdout/stderr for each backup |
Example: backup_my-vm_1739030406.log — log from backup run at Unix epoch 1739030406.
Each replication operation creates a timestamped log:
| Pattern | Location | Purpose |
|---|---|---|
<endpoint>_<date>_<time>.log |
_state/replication_logs/cloud/ |
Cloud endpoint replication logs |
<transport>_<date>_<time>.log |
_state/replication_logs/local/ |
Local transport logs |
Examples:
cloudprovider_20260212_020134.log— Cloud upload loglocal_20260212_020134.log— Local rsync log
Debug output from email report generation:
| Pattern | Location | Purpose |
|---|---|---|
debug_<epoch>.log |
_state/email/ |
Email report generation debug |
db_query_<epoch>.log |
_state/email/ |
Database query debug output |
Daily state snapshots provide disaster recovery for the _state/ directory:
| Pattern | Location | Purpose | Rotation |
|---|---|---|---|
state-<YYYYMMDD>.tar.gz |
_state/backups/ |
Daily snapshot of critical state files | STATE_BACKUP_KEEP_DAYS (default: 90) |
Contents of state backups:
vmbackup.db— SQLite database.last_month— Month boundary filecloud_replication_state.txt— Cloud trackinglocal_replication_state.txt— Local trackinglocks/— Lock files (usually empty at backup time)logs/— All log files (vmbackup.log + per-VM logs)replication_logs/— Cloud and local replication logsemail/— Email debug logs
Exclusions: backups/ directory and *.lock files are excluded from the archive.
Cleanup: State backups older than STATE_BACKUP_KEEP_DAYS are automatically purged.
Log files are automatically cleaned up after being captured in state backups:
| Setting | Default | Purpose |
|---|---|---|
STATE_BACKUP_KEEP_DAYS |
90 | Days to keep state-*.tar.gz archives |
LOG_KEEP_DAYS |
30 | Days to keep live log files |
LOG_MAX_BYTES |
52428800 (50 MiB) | Size cap for central vmbackup.log / vmprune.log before rotation |
Cleanup behavior (cleanup_old_logs()):
- Invoked once per calendar day at session start (gated by
${STATE_DIR}/.last_rotationsentinel) - Size-based pre-rotation: if
vmbackup.logorvmprune.logexceedsLOG_MAX_BYTES, the file is renamed to<name>.<epoch>and a fresh empty file is created (root:backup, mode 0640) - Daily state backup created (captures all current logs)
- Old state archives deleted (>
STATE_BACKUP_KEEP_DAYS) - Old log files deleted (>
LOG_KEEP_DAYS) — includes per-VM logs, replication logs, email debug logs, and rotatedvmbackup.log.<epoch>/vmprune.log.<epoch>files
| Directory | Contents | Rotation |
|---|---|---|
_state/logs/ |
vmbackup.log + vmprune.log + per-VM logs | LOG_MAX_BYTES (central, size-based) + LOG_KEEP_DAYS (per-VM + rotated central) |
_state/replication_logs/cloud/ |
Cloud endpoint logs | LOG_KEEP_DAYS |
_state/replication_logs/local/ |
Local transport logs | LOG_KEEP_DAYS |
_state/email/ |
Email debug files | LOG_KEEP_DAYS |
_state/backups/ |
State backup archives | STATE_BACKUP_KEEP_DAYS |
Note: Central
vmbackup.logandvmprune.logare held open by the running session via append-redirect. When rotation triggers, the running session continues writing to the renamed.epochfile; the next session opens the fresh empty file. This is race-free by construction (nocopytruncatewindow).
Replication runs after backup completes, copying backups to secondary storage. Two replication systems operate independently:
- Local Replication: rsync to any locally accessible path (local disk, NFS, virtiofs, pre-mounted CIFS, or any mounted filesystem)
- Cloud Replication: rclone to cloud storage (ships with SharePoint; extensible to any rclone-supported backend)
Both systems use a pluggable transport architecture. New endpoints can be added by implementing the corresponding transport contract — see Transport Function Contract for local transports and Cloud Transport Metrics Contract for cloud transports.
The REPLICATION_ORDER setting in vmbackup.conf controls execution:
| Value | Behavior | Use Case |
|---|---|---|
simultaneous |
Run local and cloud in parallel | Fastest, default |
local_first |
Complete local before starting cloud | Ensure local copy first |
cloud_first |
Complete cloud before starting local | Rare, offsite priority |
# In vmbackup.conf
REPLICATION_ORDER="simultaneous"Note: The old
CLOUD_REPLICATION_WAIT_FOR_LOCALin replication_cloud.conf is deprecated. UseREPLICATION_ORDERin vmbackup.conf instead.
Syncs backups to secondary local/NAS storage via rsync.
Configuration: config/<instance>/replication_local.conf
Key Settings:
| Setting | Values | Description |
|---|---|---|
REPLICATION_ENABLED |
yes/no | Master enable |
REPLICATION_ON_FAILURE |
continue/abort | Error handling |
REPLICATION_SPACE_CHECK |
skip/warn/disabled | Pre-replication space check |
REPLICATION_MIN_FREE_PERCENT |
0-100 | Minimum free space threshold |
DEST_N_TRANSPORT |
local (others pluggable) | Transport driver for destination N. Currently ships with local (rsync to any mounted path). Additional transports can be added — see Implementing a New Transport. |
DEST_N_SYNC_MODE |
mirror/accumulate | Delete behavior |
DEST_N_VERIFY |
none/size/checksum | Post-sync verification |
Space check behaviour: For mirror mode, the space check calculates the effective delta (source size minus existing destination data) rather than the total source size, since mirror only transfers changes. For accumulate mode, the full source size is used.
Key Functions:
| Function | Purpose | Returns |
|---|---|---|
load_local_replication_module() |
Load config, discover destinations | Sets REPLICATION_ENABLED |
replicate_batch() |
Sync all backups to all destinations | Status per destination |
get_replication_summary() |
Format summary for email | Text summary |
Uploads backups to cloud storage via rclone.
Configuration: config/<instance>/replication_cloud.conf
Authentication: For SharePoint, see Cloud Authentication (SharePoint) for setup instructions and the three-layer configurability model (site URL, document library, folder are all independently configurable).
Global Settings Reference:
| Setting | Values | Default | Description |
|---|---|---|---|
CLOUD_REPLICATION_ENABLED |
yes/no | no | Master enable |
CLOUD_REPLICATION_SCOPE |
everything/archives-only/monthly | everything | What to upload |
CLOUD_REPLICATION_SYNC_MODE |
mirror/accumulate-all/accumulate-valid | mirror | Delete behavior |
CLOUD_REPLICATION_POST_VERIFY |
none/size/checksum | checksum | Post-upload verification |
CLOUD_REPLICATION_ON_FAILURE |
continue/abort | continue | Error handling |
CLOUD_REPLICATION_DEFAULT_BWLIMIT |
KB/s | 0 | Bandwidth limit |
CLOUD_REPLICATION_DRY_RUN |
yes/no | no | Test mode |
Per-Destination Overrides:
| Setting | Purpose |
|---|---|
CLOUD_DEST_N_ENABLED |
Enable this destination |
CLOUD_DEST_N_NAME |
Human-readable name |
CLOUD_DEST_N_PROVIDER |
sharepoint/backblaze/wasabi |
CLOUD_DEST_N_REMOTE |
rclone remote name |
CLOUD_DEST_N_PATH |
Path within remote |
CLOUD_DEST_N_SCOPE |
Override global scope |
CLOUD_DEST_N_SYNC_MODE |
Override global sync mode |
CLOUD_DEST_N_BWLIMIT |
Destination-specific limit |
CLOUD_DEST_N_VERIFY |
Override global verification |
CLOUD_DEST_N_MAX_SIZE |
Provider's max file size limit |
CLOUD_DEST_N_SECRET_EXPIRY |
Credential expiry date (YYYY-MM-DD) |
Pluggable drivers for local replication. The replication module loads the appropriate driver based on DEST_N_TRANSPORT and calls a standard set of functions. See Implementing a New Transport below for the full contract.
Current status: Only
transport_local.shis production-ready. The SSH and SMB transport files are scaffolds with the correct function signatures and metrics contract but no implementation — they returnexit 1if loaded. They exist as starting points for anyone who needs those transports. The local transport covers any destination that appears as a mounted filesystem, which in practice includes NFS, virtiofs, and pre-mounted CIFS/SMB shares.
| Driver | Status | Purpose |
|---|---|---|
transport_local.sh |
Production | Any mounted filesystem (local disk, NFS, virtiofs, pre-mounted CIFS) |
transport_ssh.sh |
Scaffold | Remote SSH/rsync (not yet implemented) |
transport_smb.sh |
Scaffold | SMB/CIFS shares (not yet implemented) |
Pluggable drivers for cloud replication. Cloud transports use a different function naming convention (cloud_transport_<provider>_*) and metrics contract.
| Driver | Status | Purpose |
|---|---|---|
cloud_transport_sharepoint.sh |
Production | SharePoint Online via rclone |
All local transport drivers (transports/transport_*.sh) must export the following functions. The replication module (replication_local_module.sh) calls these by name after sourcing the driver via _load_transport().
| Function | Signature | Purpose | Return |
|---|---|---|---|
transport_init() |
transport_init <dest_path> <dest_name> |
Verify destination is accessible and writable. Populate space metrics. | 0 = ready, 1 = failed |
transport_sync() |
transport_sync <source_path> <dest_path> <sync_mode> <bwlimit> <dry_run> <dest_name> |
Perform rsync to destination. Must populate all metrics globals on completion. | 0 = success, 1 = failed |
transport_verify() |
transport_verify <source_path> <dest_path> <verify_mode> |
Post-sync verification. verify_mode is none, size, or checksum. |
0 = verified, 1 = mismatch |
transport_cleanup() |
transport_cleanup [<dest_path>] |
Cleanup after sync (unmount, close connections, etc.). No-op is acceptable. | 0 always |
transport_get_free_space() |
transport_get_free_space <dest_path> |
Return available bytes at destination (echo to stdout). | Echoes integer bytes, 0 if unknown |
Every transport must declare and set these identification globals:
TRANSPORT_NAME="mytransport" # Short name matching the filename suffix
TRANSPORT_VERSION="1.0" # Semver stringAll local transports must set these globals after transport_sync() returns, so the replication module can pass them to sqlite_log_replication_run() for audit logging.
| Global | Type | Description |
|---|---|---|
TRANSPORT_BYTES_TRANSFERRED |
integer | Bytes transferred (0 if none) |
TRANSPORT_SYNC_DURATION |
integer | Elapsed seconds |
TRANSPORT_DEST_AVAIL_BYTES |
integer | Free bytes at destination (0 if unknown) |
TRANSPORT_DEST_TOTAL_BYTES |
integer | Total bytes at destination (0 if unknown) |
TRANSPORT_DEST_SPACE_KNOWN |
0/1 |
Whether space metrics are reliable |
TRANSPORT_THROTTLE_COUNT |
integer | Throttle events during sync (-1 if not applicable) |
TRANSPORT_BWLIMIT_FINAL |
string | Final bandwidth limit after adjustments ("" if none) |
Sentinel values:
| Value | Meaning |
|---|---|
-1 |
Metric is structurally not applicable to this transport type |
0 |
Applicable but none occurred / not available |
"" |
No value (string fields) |
Per-transport defaults to set at initialisation:
| Transport | TRANSPORT_THROTTLE_COUNT |
TRANSPORT_DEST_SPACE_KNOWN |
Rationale |
|---|---|---|---|
| local | -1 |
1 |
rsync doesn't throttle; local df is always reliable |
| ssh | -1 |
0 or 1 |
rsync over SSH doesn't throttle; remote df may or may not work |
| smb | -1 |
1 when mounted |
rsync doesn't throttle; mounted CIFS has reliable df |
Cloud transports use a separate set of globals prefixed with CLOUD_TRANSPORT_:
| Global | Type | Description |
|---|---|---|
CLOUD_TRANSPORT_THROTTLE_COUNT |
integer | Throttle events (e.g., HTTP 429 responses) |
CLOUD_TRANSPORT_BWLIMIT_FINAL |
string | Final bandwidth limit after adjustments |
CLOUD_TRANSPORT_DEST_AVAIL_BYTES |
integer | Free bytes at destination (0 if unknown) |
CLOUD_TRANSPORT_DEST_TOTAL_BYTES |
integer | Total bytes at destination (0 if unknown) |
CLOUD_TRANSPORT_DEST_SPACE_KNOWN |
0/1 |
Whether space metrics are reliable |
Cloud transport function names follow the pattern cloud_transport_<provider>_<action>() (e.g., cloud_transport_sharepoint_upload()). Unlike local transports, cloud drivers are called by the cloud replication module directly by provider-specific function names.
To add a new local transport (e.g., SSH):
1. Create the driver file:
# transports/transport_ssh.sh
TRANSPORT_NAME="ssh"
TRANSPORT_VERSION="1.0"2. Initialise metrics globals at the top of the file (before any function):
TRANSPORT_BYTES_TRANSFERRED=0
TRANSPORT_SYNC_DURATION=0
TRANSPORT_DEST_AVAIL_BYTES=0
TRANSPORT_DEST_TOTAL_BYTES=0
TRANSPORT_DEST_SPACE_KNOWN=0
TRANSPORT_THROTTLE_COUNT=-1 # rsync doesn't throttle
TRANSPORT_BWLIMIT_FINAL=""3. Load shared utilities:
source "${SCRIPT_DIR}/lib/transfer_utils.sh"transfer_utils.sh provides: tu_format_bytes(), tu_format_progress(), tu_get_dir_size(), tu_rsync_exit_message(), tu_get_replication_log_path().
4. Implement all 5 required functions. The replication module calls them in this order:
transport_init() → transport_sync() → transport_verify() → transport_cleanup()
↑ only if verify_mode ≠ "none"
transport_get_free_space() is called independently for space checks.
5. Use the logging wrappers from the parent scope (log_info, log_warn, log_error, log_debug). Format: log_info "transport_ssh" "function_name" "message".
6. Set all metrics globals before returning from transport_sync(). The replication module reads them immediately after the call returns.
7. File naming and loading: The file must be named transport_<type>.sh where <type> matches the DEST_N_TRANSPORT config value. The replication module loads it via source "$script_dir/transports/transport_${transport_type}.sh".
8. Permissions: The Makefile installs transports with 750 (root:backup, executable). Add your file to the Makefile install target.
This section shows how the backup directory structure evolves over three months with multiple policy changes, using a VM named prod-webserver.
BACKUP_PATH="/mnt/backup/vms/"
DEFAULT_ROTATION_POLICY="monthly"
RETENTION_MONTHS=3After a month of daily incremental backups, the chain contains a full backup plus ~28 incrementals:
/mnt/backup/vms/prod-webserver/
└── 202602/ ← ACTIVE (monthly: YYYYMM)
├── vda.full.data # 11.4 GiB (checkpoint 0)
├── vda.inc.virtnbdbackup.1-27.data # ~6.5 GiB total
└── checkpoints/ # 28 checkpoint files
New month triggers automatic archival of the February chain and a fresh full backup:
/mnt/backup/vms/prod-webserver/
├── 202602/.archives/chain-2026-03-01/ # Archived Feb chain (~18 GiB)
└── 202603/ ← NEW ACTIVE
└── vda.full.data # 12.0 GiB
A policy change forces immediate archival of the current chain, regardless of chain age. The period directory format changes from YYYYMM to YYYY-Www:
/mnt/backup/vms/prod-webserver/
├── 202602/.archives/chain-2026-03-01/ # Feb monthly (~18 GiB)
├── 202603/.archives/chain-2026-03-10/ # Partial March, archived on policy change
└── 2026-W11/ ← NEW ACTIVE (weekly)
└── vda.full.data
After three weekly boundaries, each week has been archived:
/mnt/backup/vms/prod-webserver/
├── 202602/.archives/chain-2026-03-01/ # 18 GiB
├── 202603/.archives/chain-2026-03-10/ # 14 GiB
├── 2026-W11/.archives/chain-2026-03-17/ # 14 GiB
├── 2026-W12/.archives/chain-2026-03-24/ # 14 GiB
├── 2026-W13/.archives/chain-2026-03-31/ # 14 GiB
└── 2026-W14/ ← ACTIVE (~86 GiB total)
Daily policy means one full backup per day — storage-intensive. Each day boundary archives a single-full chain and starts a new one:
/mnt/backup/vms/prod-webserver/
├── ...previous periods...
├── 2026-W14/.archives/chain-2026-04-01/ # Archived weekly chain
├── 20260401/.archives/chain-2026-04-02/ # 12.0 GiB each
├── 20260402/.archives/chain-2026-04-03/
├── ...
└── 20260407/ ← ACTIVE (daily: YYYYMMDD)
With RETENTION_DAYS=7 and 8 daily directories, the oldest (20260401) is purged:
├── 20260402/.archives/... ← Oldest remaining
├── ...
└── 20260408/ ← ACTIVE (~96 GiB total)
Another policy change archives the remaining daily chain and starts a monthly period:
/mnt/backup/vms/prod-webserver/
├── 202602/.archives/... # Feb monthly
├── 202603/.archives/... # Partial March
├── 2026-W11 - W14/.archives/... # Weekly archives
├── 20260409-15/.archives/... # Daily archives (7 kept by retention)
└── 202604/ ← ACTIVE MONTHLY
└── vda.full.data # ~178 GiB total
| Date | Policy | Active Chain | Event | Cumulative Storage |
|---|---|---|---|---|
| Feb 28 | monthly | 202602 (28 pts) | — | ~18 GiB |
| Mar 1 | monthly | 202603 (1 pt) | Period boundary | ~30 GiB |
| Mar 10 | weekly | 2026-W11 (1 pt) | Policy change | ~44 GiB |
| Mar 31 | weekly | 2026-W14 (2 pts) | 3 weekly boundaries | ~86 GiB |
| Apr 1 | daily | 20260401 (1 pt) | Policy change | ~100 GiB |
| Apr 8 | daily | 20260408 (1 pt) | Retention purge | ~96 GiB |
| Apr 15 | monthly | 202604 (1 pt) | Policy change | ~178 GiB |
- Policy change = forced archive — every policy change triggers immediate archival regardless of chain age
- Period boundary = automatic archive — crossing a day/week/month boundary archives the entire chain
- Daily policy is expensive — one full backup per day causes rapid storage growth
- Archive naming —
chain-YYYY-MM-DD[.N]date is when archived, not when the chain started - Cross-policy orphans — old-format directories from previous policies are handled by Tier 2 orphan retention
| Issue | Description | Mitigation |
|---|---|---|
| #267 | Checkpoint bitmap corruption | 20% disk threshold, monthly rotation |
| #226 | Destination full causes corruption | Enhanced disk space checks |
| #223 | "bitmap not found in backing chain" | Auto-recovery option |
| #102 | FSFREEZE timeout on some guests | Timeout guard |
Affects: ALL Windows VMs using VirtIO disks (virtio-blk or virtio-scsi). Not setup-specific — this is a universal QEMU upstream default.
Symptom: guest-fstrim takes 3–15 minutes on a Windows VM with VirtIO disks, vs 1–3 seconds on Linux or SATA. The fstrim module logs a timeout warning or the backup stall exceeds FSTRIM_WINDOWS_TIMEOUT.
Root Cause: QEMU's discard_granularity property for all block device types defaults to 4294967295 (0xFFFFFFFF) — a sentinel value meaning "unset". When virtio-blk reports this to the guest, it resolves to the logical block size (typically 512 bytes). Windows then issues millions of individual 512-byte TRIM operations, each traversing the virtio ring and punching a separate hole in the qcow2 file. This is catastrophically slow.
Linux guests are unaffected because the Linux virtio-blk driver coalesces discard requests at the block layer regardless of the reported granularity.
SATA (ide-hd) guests work acceptably because AHCI handles TRIM via ATA Data Set Management commands at sector granularity through a different code path.
QEMU evidence (all versions including 10.0.7):
$ qemu-system-x86_64 -device virtio-blk-pci,help | grep discard_granularity
discard_granularity=<size> - (default: 4294967295)
Performance impact (from Red Hat Bugzilla 2020998, 20 GB disk):
| discard_granularity | virtio-blk time | virtio-scsi time |
|---|---|---|
| 4K (effective default) | 615s (~10 min) | 575s (~10 min) |
| 32K | 78s | 149s |
| 256K | 15s | 15s |
| 2M | 4s | 3s |
| 16M | 1.4s | 1.4s |
| 32M (recommended) | 1.2s | 1.7s |
Fix — libvirt XML (qemu:override):
Add the qemu: XML namespace and override block to the domain XML. The alias must match each disk's device alias (visible in virsh dumpxml).
Important: Every VirtIO disk with discard='unmap' needs its own override entry. Missing even one disk causes slow TRIM for that volume. The pre-flight check in check_discard_granularity() will warn you about any missing overrides at backup time.
Step 1 — Find your VirtIO disk aliases:
# List all VirtIO disk aliases with discard enabled
virsh dumpxml <vm-name> | awk '
/<disk / { in_disk=1; d=0; v=0; a="" }
in_disk && /discard=.unmap/ { d=1 }
in_disk && /bus=.virtio/ { v=1 }
in_disk && /alias name=/ { s=$0; sub(/.*alias name=./, "", s); sub(/[^a-zA-Z0-9_-].*/, "", s); a=s }
/<\/disk>/ { if (d && v && a != "") print a; in_disk=0 }
'Example output for a Windows VM with two VirtIO disks and one SATA disk:
virtio-disk0 ← vda (OS disk)
virtio-disk1 ← vdb (data disk)
sata0-0-0 is not listed (SATA — not affected)
Step 2 — Add the override block to the domain XML:
Single-disk example (one VirtIO disk):
<domain type="kvm" xmlns:qemu="http://libvirt.org/schemas/domain/qemu/1.0">
<!-- ... existing domain XML ... -->
<qemu:override>
<qemu:device alias="virtio-disk0">
<qemu:frontend>
<qemu:property name="discard_granularity" type="unsigned" value="33554432"/>
</qemu:frontend>
</qemu:device>
</qemu:override>
</domain>Multi-disk example (two VirtIO disks — vda and vdb):
<domain type="kvm" xmlns:qemu="http://libvirt.org/schemas/domain/qemu/1.0">
<!-- ... existing domain XML ... -->
<qemu:override>
<qemu:device alias="virtio-disk0">
<qemu:frontend>
<qemu:property name="discard_granularity" type="unsigned" value="33554432"/>
</qemu:frontend>
</qemu:device>
<qemu:device alias="virtio-disk1">
<qemu:frontend>
<qemu:property name="discard_granularity" type="unsigned" value="33554432"/>
</qemu:frontend>
</qemu:device>
</qemu:override>
</domain>Pattern: One
<qemu:device alias="...">block per VirtIO disk. The alias names increment automatically (virtio-disk0,virtio-disk1, etc.). SATA disks (sata0-0-0) do not need this fix.
The xmlns:qemu namespace declaration must be on the <domain> tag. The value 33554432 is 32 MiB in bytes — matching Microsoft Hyper-V's default. After editing, apply with virsh define and power-cycle the VM (reboot alone is not sufficient — the QEMU process must restart).
Step 3 — Verify the fix:
# Quick TRIM timing test (VM must be running with guest agent)
virsh qemu-agent-command <vm-name> --timeout 60 '{"execute":"guest-fstrim"}'
# Should complete in 1-3 seconds instead of minutesFix — QEMU command line (non-libvirt):
-device virtio-blk-pci,...,discard_granularity=32M
# or for virtio-scsi:
-device scsi-hd,...,discard_granularity=32Mvmbackup pre-flight detection: The check_discard_granularity() function runs automatically before FSTRIM for Windows VMs. It parses virsh dumpxml, identifies all VirtIO disks with discard='unmap', cross-references the <qemu:override> section, and logs a warning with the exact XML fix for any disk missing the override. This runs once per VM per session.
vmbackup timeout safety net: The fstrim module detects Windows guests via guest-get-osinfo and applies FSTRIM_WINDOWS_TIMEOUT=600s (configurable) as a fallback. This allows unfixed VMs to complete trim within the timeout rather than being killed at 300s. However, applying the XML fix above is strongly recommended — it reduces trim from minutes to seconds.
References:
- Red Hat Bugzilla 2020998 — "Windows 10 TRIM/Discard causes all data to be rewritten" (CLOSED ERRATA)
- RHBA-2023:2451 — virtio-win driver fix (viostor build 228+)
- kvm-guest-drivers-windows #666 — upstream issue
- kvm-guest-drivers-windows PR #824 — viostor driver fix
- QEMU source:
hw/block/virtio-blk.c→virtio_blk_update_config()— sentinel fallback toblk_size
Note: The virtio-win driver fix (RHBA-2023:2451) addresses the "Slab size is too small" error but does NOT fix the performance issue. Even with updated drivers, the QEMU
discard_granularityoverride is required for fast TRIM.
Affects: All guests (Windows and Linux) running the QEMU guest agent. Observed more frequently on Windows due to longer TRIM times on unfixed VMs.
Symptom: After interrupting a running guest-fstrim command (Ctrl+C during manual testing, or SIGTERM/SIGINT during a backup run), all subsequent virsh qemu-agent-command calls to that VM fail with:
error: Guest agent is not responding: QEMU guest agent is not connected
The VM itself continues to run normally — only the guest agent communication channel is broken.
Root Cause: The QEMU guest agent processes guest-fstrim synchronously. When the host-side virsh command is killed, the agent's internal state machine does not clean up properly. The virtio-serial channel enters a wedged state where the agent process is still running inside the guest but cannot accept new commands.
Recovery:
- Immediate: Power-cycle the VM (
virsh destroy+virsh start). A simplevirsh rebootis NOT sufficient — the QEMU process and virtio-serial channel must be fully torn down and recreated. - Inside the guest (if accessible): Restart the QEMU guest agent service:
- Linux:
systemctl restart qemu-guest-agent - Windows: Restart the "QEMU Guest Agent" service in
services.msc, or:net stop QEMU-GA && net start QEMU-GA
- Linux:
Prevention:
- Do NOT manually Ctrl+C a
guest-fstrimcommand while it is running - Apply the
discard_granularityfix (see above) to ensure TRIM completes in seconds rather than minutes, reducing the window where interruption is likely - vmbackup's interrupt handler avoids killing in-progress FSTRIM — the
_BACKUP_IN_PROGRESSflag gates interrupt handling to prevent false chain-break markers
Note: This is a known QEMU guest agent limitation, not a vmbackup bug. There is no upstream fix as of QEMU 10.0. The agent protocol does not support cancellation of in-progress commands.
# Auto-recovery on checkpoint corruption
ENABLE_AUTO_RECOVERY_ON_CHECKPOINT_CORRUPTION="yes" # yes|warn|no
# AUTO→FULL conversion on failure
CHECKPOINT_RETRY_AUTO_TO_FULL="yes"
CHECKPOINT_MAX_RETRIES_AUTO=1vmbackup returns categorised exit codes so monitoring systems can distinguish why a run failed without scraping logs. Backward-compatible — if vmbackup; then and (( $? != 0 )) patterns continue to work.
| Code | Meaning | Example trigger |
|---|---|---|
| 0 | Success | Normal completion, --help, --version, dry-run, --cancel-replication |
| 1 | General error | Backup session ended with one or more VM failures |
| 2 | Configuration error | Missing config file, named instance not found, not running as root |
| 3 | Lock conflict | Another vmbackup session already running |
| 4 | Storage error | Backup destination unreachable, disk full, scratch path missing |
| 5 | VM problem | Targeted VM not found in libvirt |
| 6 | External tool failure | Dependency check on external tool failed; cloud upload errors. Note: virtnbdbackup failures during a backup pipeline are reported as 1 (see SQLite session row and email report for tool-level detail). |
| 7 | CLI usage error | Unknown flag, missing required argument, mutually exclusive flags combined |
| 8 | Missing dependency | Required tool not installed, integration module not found |
| 130 | SIGINT (Ctrl-C) | Shell convention 128 + 2 |
| 143 | SIGTERM | Shell convention 128 + 15 |
Symmetric with vmrestore — the same number means the same category in both tools, so monitoring rules can be written once and applied to either binary.
Monitoring integration patterns: page on 4 or 6 (storage/tool — usually needs immediate attention), log-only on 3 (lock conflicts often resolve on next cron run), alert on 8 (deployment problem).