CALM-DACVAE

An attempt to reproduce CALM (Continuous Audio Language Models) using DACVAE as the audio VAE.

Overview

This repository implements the CALM architecture for text-to-speech, which models continuous audio latents autoregressively with a flow head trained via flow matching and Lagrangian Self-Distillation (LSD) for few-step inference.

Key design choices:

Audio VAE: Uses facebook/dacvae-watermarked, operating on its 128-dimensional latent space.
LSD training: Lagrangian Self-Distillation (a flow-map training method that generalizes consistency models for few-step generation) follows the official flow-maps implementation.
Model architecture: Based on the paper and the publicly released pocket-tts weights. There may be minor differences from the original architecture.
Inference: Reference implementation based on pocket-tts.
Components not described in the paper (e.g. EOS head for end-of-sequence prediction) are implemented with reasonable heuristics.

Current Status

This is a work-in-progress reproduction and does not yet produce usable speech.

Training runs without issues and loss decreases steadily.
The model has not yet achieved intelligible speech generation.
It is unclear whether this is due to bugs in the training/inference code, insufficient model capacity, or insufficient training data/compute.
- DACVAE's latent dimension (128) is comparable to the Music VAE in the CALM paper, which uses a much larger model (backbone 1.35B / flow head 601M) than the TTS setup (32-dim latent). The current configs may need to be scaled up accordingly.

Contributions, bug reports, and discussions are welcome.

Installation

Requires Python >= 3.11.

git clone https://github.com/Aratako/CALM-DACVAE.git
cd CALM-DACVAE
uv sync

Usage

1. Prepare Manifest (Precompute DACVAE Latents)

Encodes audio from a Hugging Face dataset into DACVAE latents and produces a JSONL manifest for training.

uv run python -m calm_dacvae.prepare_manifest \
  --dataset myorg/my_dataset \
  --split train \
  --audio-column audio \
  --text-column text \
  --output-manifest data/train_manifest.jsonl \
  --latent-dir data/latents \
  --device cuda

To include speaker_id in the manifest (for speaker-conditioned training):

uv run python -m calm_dacvae.prepare_manifest \
  --dataset myorg/my_dataset \
  --split train \
  --audio-column audio \
  --text-column text \
  --speaker-column speaker \
  --output-manifest data/train_manifest.jsonl \
  --latent-dir data/latents \
  --device cuda

Multi-GPU encoding is supported via --num-gpus or torchrun.

2. Compute Latent Stats (Optional)

Computes global mean/std for latent normalization. If omitted, training will compute this automatically on startup.

uv run python -m calm_dacvae.compute_stats \
  --manifest data/train_manifest.jsonl \
  --output data/latent_stats.pt

3. Train

uv run python -m calm_dacvae.train \
  --manifest data/train_manifest.jsonl \
  --out-dir outputs/run1 \
  --config configs/train_pocket_tts_teacher_like.yaml \
  --tokenizer-name llm-jp/llm-jp-3-440m

Two config presets are provided:

Config	Description
`configs/train_pocket_tts_teacher_like.yaml`	CALM TTS teacher-scale model (24 layers, 1024 dim)
`configs/train_pocket_tts_student_like.yaml`	Pocket-TTS student-scale model (6 layers, 1024 dim)

Multi-GPU training with DDP:

uv run torchrun --standalone --nproc_per_node=4 -m calm_dacvae.train \
  --manifest data/train_manifest.jsonl \
  --out-dir outputs/run1 \
  --config configs/train_pocket_tts_teacher_like.yaml \
  --tokenizer-name llm-jp/llm-jp-3-440m \
  --device cuda

W&B logging:

uv run python -m calm_dacvae.train \
  --manifest data/train_manifest.jsonl \
  --out-dir outputs/run1 \
  --config configs/train_pocket_tts_teacher_like.yaml \
  --tokenizer-name llm-jp/llm-jp-3-440m \
  --wandb --wandb-project calm-dacvae

4. Inference

uv run python -m calm_dacvae.infer \
  --checkpoint outputs/run1/checkpoint_latest.pt \
  --text "Hello world, this is a test." \
  --output-wav outputs/sample.wav \
  --seconds 6 \
  --lsd-steps 1 \
  --temperature 0.9 \
  --cfg-text-scale 1.5 \
  --device cuda

Voice cloning with a reference latent (requires a model trained with --speaker-pair-mode condition_prefix):

uv run python -m calm_dacvae.infer \
  --checkpoint outputs/run1/checkpoint_latest.pt \
  --text "Hello world." \
  --ref-latent data/ref_latent.pt \
  --output-wav outputs/clone.wav \
  --seconds 4 \
  --device cuda

Implementation Notes

Flow matching: Trigonometric noising schedule (x_t = sin(theta)*x + cos(theta)*eps where theta = (pi/2)*t), with adaptive loss weighting per the paper's appendix.
LSD (Lagrangian Self-Distillation): Self-distillation using current parameters (not EMA). The d/dt X_{s,t} term is computed via autograd JVP.
Head batch multiplier: Multiple flow head samples per backbone forward pass (default: 8).
CFG: Supports text-only, audio-prefix-only, and joint (text + audio) classifier-free guidance at inference.
EOS: A lightweight EOS head predicts end-of-sequence to enable variable-length generation.
Speaker conditioning: condition_prefix mode injects reference latent as a prefix to the long branch (pocket-tts style).

References

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
calm_dacvae		calm_dacvae
configs		configs
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CALM-DACVAE

Overview

Current Status

Installation

Usage

1. Prepare Manifest (Precompute DACVAE Latents)

2. Compute Latent Stats (Optional)

3. Train

4. Inference

Implementation Notes

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CALM-DACVAE

Overview

Current Status

Installation

Usage

1. Prepare Manifest (Precompute DACVAE Latents)

2. Compute Latent Stats (Optional)

3. Train

4. Inference

Implementation Notes

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages