Skip to content

Aratako/CALM-DACVAE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CALM-DACVAE

An attempt to reproduce CALM (Continuous Audio Language Models) using DACVAE as the audio VAE.

Overview

This repository implements the CALM architecture for text-to-speech, which models continuous audio latents autoregressively with a flow head trained via flow matching and Lagrangian Self-Distillation (LSD) for few-step inference.

Key design choices:

  • Audio VAE: Uses facebook/dacvae-watermarked, operating on its 128-dimensional latent space.
  • LSD training: Lagrangian Self-Distillation (a flow-map training method that generalizes consistency models for few-step generation) follows the official flow-maps implementation.
  • Model architecture: Based on the paper and the publicly released pocket-tts weights. There may be minor differences from the original architecture.
  • Inference: Reference implementation based on pocket-tts.
  • Components not described in the paper (e.g. EOS head for end-of-sequence prediction) are implemented with reasonable heuristics.

Current Status

This is a work-in-progress reproduction and does not yet produce usable speech.

  • Training runs without issues and loss decreases steadily.
  • The model has not yet achieved intelligible speech generation.
  • It is unclear whether this is due to bugs in the training/inference code, insufficient model capacity, or insufficient training data/compute.
    • DACVAE's latent dimension (128) is comparable to the Music VAE in the CALM paper, which uses a much larger model (backbone 1.35B / flow head 601M) than the TTS setup (32-dim latent). The current configs may need to be scaled up accordingly.

Contributions, bug reports, and discussions are welcome.

Installation

Requires Python >= 3.11.

git clone https://github.com/Aratako/CALM-DACVAE.git
cd CALM-DACVAE
uv sync

Usage

1. Prepare Manifest (Precompute DACVAE Latents)

Encodes audio from a Hugging Face dataset into DACVAE latents and produces a JSONL manifest for training.

uv run python -m calm_dacvae.prepare_manifest \
  --dataset myorg/my_dataset \
  --split train \
  --audio-column audio \
  --text-column text \
  --output-manifest data/train_manifest.jsonl \
  --latent-dir data/latents \
  --device cuda

To include speaker_id in the manifest (for speaker-conditioned training):

uv run python -m calm_dacvae.prepare_manifest \
  --dataset myorg/my_dataset \
  --split train \
  --audio-column audio \
  --text-column text \
  --speaker-column speaker \
  --output-manifest data/train_manifest.jsonl \
  --latent-dir data/latents \
  --device cuda

Multi-GPU encoding is supported via --num-gpus or torchrun.

2. Compute Latent Stats (Optional)

Computes global mean/std for latent normalization. If omitted, training will compute this automatically on startup.

uv run python -m calm_dacvae.compute_stats \
  --manifest data/train_manifest.jsonl \
  --output data/latent_stats.pt

3. Train

uv run python -m calm_dacvae.train \
  --manifest data/train_manifest.jsonl \
  --out-dir outputs/run1 \
  --config configs/train_pocket_tts_teacher_like.yaml \
  --tokenizer-name llm-jp/llm-jp-3-440m

Two config presets are provided:

Config Description
configs/train_pocket_tts_teacher_like.yaml CALM TTS teacher-scale model (24 layers, 1024 dim)
configs/train_pocket_tts_student_like.yaml Pocket-TTS student-scale model (6 layers, 1024 dim)

Multi-GPU training with DDP:

uv run torchrun --standalone --nproc_per_node=4 -m calm_dacvae.train \
  --manifest data/train_manifest.jsonl \
  --out-dir outputs/run1 \
  --config configs/train_pocket_tts_teacher_like.yaml \
  --tokenizer-name llm-jp/llm-jp-3-440m \
  --device cuda

W&B logging:

uv run python -m calm_dacvae.train \
  --manifest data/train_manifest.jsonl \
  --out-dir outputs/run1 \
  --config configs/train_pocket_tts_teacher_like.yaml \
  --tokenizer-name llm-jp/llm-jp-3-440m \
  --wandb --wandb-project calm-dacvae

4. Inference

uv run python -m calm_dacvae.infer \
  --checkpoint outputs/run1/checkpoint_latest.pt \
  --text "Hello world, this is a test." \
  --output-wav outputs/sample.wav \
  --seconds 6 \
  --lsd-steps 1 \
  --temperature 0.9 \
  --cfg-text-scale 1.5 \
  --device cuda

Voice cloning with a reference latent (requires a model trained with --speaker-pair-mode condition_prefix):

uv run python -m calm_dacvae.infer \
  --checkpoint outputs/run1/checkpoint_latest.pt \
  --text "Hello world." \
  --ref-latent data/ref_latent.pt \
  --output-wav outputs/clone.wav \
  --seconds 4 \
  --device cuda

Implementation Notes

  • Flow matching: Trigonometric noising schedule (x_t = sin(theta)*x + cos(theta)*eps where theta = (pi/2)*t), with adaptive loss weighting per the paper's appendix.
  • LSD (Lagrangian Self-Distillation): Self-distillation using current parameters (not EMA). The d/dt X_{s,t} term is computed via autograd JVP.
  • Head batch multiplier: Multiple flow head samples per backbone forward pass (default: 8).
  • CFG: Supports text-only, audio-prefix-only, and joint (text + audio) classifier-free guidance at inference.
  • EOS: A lightweight EOS head predicts end-of-sequence to enable variable-length generation.
  • Speaker conditioning: condition_prefix mode injects reference latent as a prefix to the long branch (pocket-tts style).

References

License

MIT

About

An attempt to reproduce CALM (Continuous Audio Language Models) using DACVAE as the audio VAE.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages