An attempt to reproduce CALM (Continuous Audio Language Models) using DACVAE as the audio VAE.
This repository implements the CALM architecture for text-to-speech, which models continuous audio latents autoregressively with a flow head trained via flow matching and Lagrangian Self-Distillation (LSD) for few-step inference.
Key design choices:
- Audio VAE: Uses facebook/dacvae-watermarked, operating on its 128-dimensional latent space.
- LSD training: Lagrangian Self-Distillation (a flow-map training method that generalizes consistency models for few-step generation) follows the official flow-maps implementation.
- Model architecture: Based on the paper and the publicly released pocket-tts weights. There may be minor differences from the original architecture.
- Inference: Reference implementation based on pocket-tts.
- Components not described in the paper (e.g. EOS head for end-of-sequence prediction) are implemented with reasonable heuristics.
This is a work-in-progress reproduction and does not yet produce usable speech.
- Training runs without issues and loss decreases steadily.
- The model has not yet achieved intelligible speech generation.
- It is unclear whether this is due to bugs in the training/inference code, insufficient model capacity, or insufficient training data/compute.
- DACVAE's latent dimension (128) is comparable to the Music VAE in the CALM paper, which uses a much larger model (backbone 1.35B / flow head 601M) than the TTS setup (32-dim latent). The current configs may need to be scaled up accordingly.
Contributions, bug reports, and discussions are welcome.
Requires Python >= 3.11.
git clone https://github.com/Aratako/CALM-DACVAE.git
cd CALM-DACVAE
uv syncEncodes audio from a Hugging Face dataset into DACVAE latents and produces a JSONL manifest for training.
uv run python -m calm_dacvae.prepare_manifest \
--dataset myorg/my_dataset \
--split train \
--audio-column audio \
--text-column text \
--output-manifest data/train_manifest.jsonl \
--latent-dir data/latents \
--device cudaTo include speaker_id in the manifest (for speaker-conditioned training):
uv run python -m calm_dacvae.prepare_manifest \
--dataset myorg/my_dataset \
--split train \
--audio-column audio \
--text-column text \
--speaker-column speaker \
--output-manifest data/train_manifest.jsonl \
--latent-dir data/latents \
--device cudaMulti-GPU encoding is supported via --num-gpus or torchrun.
Computes global mean/std for latent normalization. If omitted, training will compute this automatically on startup.
uv run python -m calm_dacvae.compute_stats \
--manifest data/train_manifest.jsonl \
--output data/latent_stats.ptuv run python -m calm_dacvae.train \
--manifest data/train_manifest.jsonl \
--out-dir outputs/run1 \
--config configs/train_pocket_tts_teacher_like.yaml \
--tokenizer-name llm-jp/llm-jp-3-440mTwo config presets are provided:
| Config | Description |
|---|---|
configs/train_pocket_tts_teacher_like.yaml |
CALM TTS teacher-scale model (24 layers, 1024 dim) |
configs/train_pocket_tts_student_like.yaml |
Pocket-TTS student-scale model (6 layers, 1024 dim) |
Multi-GPU training with DDP:
uv run torchrun --standalone --nproc_per_node=4 -m calm_dacvae.train \
--manifest data/train_manifest.jsonl \
--out-dir outputs/run1 \
--config configs/train_pocket_tts_teacher_like.yaml \
--tokenizer-name llm-jp/llm-jp-3-440m \
--device cudaW&B logging:
uv run python -m calm_dacvae.train \
--manifest data/train_manifest.jsonl \
--out-dir outputs/run1 \
--config configs/train_pocket_tts_teacher_like.yaml \
--tokenizer-name llm-jp/llm-jp-3-440m \
--wandb --wandb-project calm-dacvaeuv run python -m calm_dacvae.infer \
--checkpoint outputs/run1/checkpoint_latest.pt \
--text "Hello world, this is a test." \
--output-wav outputs/sample.wav \
--seconds 6 \
--lsd-steps 1 \
--temperature 0.9 \
--cfg-text-scale 1.5 \
--device cudaVoice cloning with a reference latent (requires a model trained with --speaker-pair-mode condition_prefix):
uv run python -m calm_dacvae.infer \
--checkpoint outputs/run1/checkpoint_latest.pt \
--text "Hello world." \
--ref-latent data/ref_latent.pt \
--output-wav outputs/clone.wav \
--seconds 4 \
--device cuda- Flow matching: Trigonometric noising schedule (
x_t = sin(theta)*x + cos(theta)*epswheretheta = (pi/2)*t), with adaptive loss weighting per the paper's appendix. - LSD (Lagrangian Self-Distillation): Self-distillation using current parameters (not EMA). The
d/dt X_{s,t}term is computed via autograd JVP. - Head batch multiplier: Multiple flow head samples per backbone forward pass (default: 8).
- CFG: Supports text-only, audio-prefix-only, and joint (text + audio) classifier-free guidance at inference.
- EOS: A lightweight EOS head predicts end-of-sequence to enable variable-length generation.
- Speaker conditioning:
condition_prefixmode injects reference latent as a prefix to the long branch (pocket-tts style).
MIT