English | 繁體中文
An offline desktop GUI application built on top of OpenBMB's state-of-the-art VoxCPM2 — a 2-billion-parameter open-source text-to-speech large language model. It bundles voice design, voice cloning, ultimate cloning, and a built-in live recording studio into one tool, so you can experience high-quality AI speech synthesis on your Windows PC without writing a single line of code.
- Dynamic Microphone Selection — Automatically detects and lists every available audio input device on your system.
- Real-Time Waveform Visualization — A built-in waveform monitor shows pink ripple animations while recording, giving you instant visual confirmation that audio is being captured.
- One-Click Apply — After recording, apply the WAV file and its transcript directly to the Voice Clone or Ultimate Clone tab with a single click.
- Create from Scratch — No reference audio needed. Simply prepend a parenthesized English description to your text (e.g.,
(A gentle young female voice, smiling)) and the model will generate a unique voice that matches your description. - Quick Preset Panel — Choose from 10+ curated voice presets covering Mandarin, Japanese, English, Korean, and more — one click to apply.
- 📋 Batch Synthesis (Audiobook Mode) — Paste a long article or novel line-by-line (one sentence per line) and the system synthesizes each line into a separate audio file in the background, preventing memory overflow or context-length crashes.
- Ultra-Short Reference Audio — Only 3–10 seconds of clean vocal recording is needed to quickly replicate a speaker's tonal characteristics.
- Randomized Takes — The random seed is deliberately left unfixed by default: the timbre stays consistent, but the pacing, pauses, and intonation vary with every generation, letting you cherry-pick the most natural-sounding take.
- Seamless Continuation — Supply both a reference audio clip and its exact transcript. The model treats the reference as a timeline prefix and seamlessly continues speaking your target text, perfectly inheriting the speaker's breathing, pitch contour, ambient noise, and emotion.
| Item | Recommendation |
|---|---|
| OS | Windows 10 / 11 (64-bit) |
| Python | 3.8 – 3.11 |
| GPU | NVIDIA GPU with ≥ 6 GB VRAM + CUDA for near-instant generation |
| Model Size | ~2 B parameters; weights ≈ 4.63 GB on disk |
Note
If no GPU is available, the application falls back to CPU mode automatically (expect ~10–30× slower generation).
git clone https://github.com/begin0808/VoxCPM2.git
cd VoxCPM2A clean virtual environment (Anaconda or venv) is strongly recommended.
pip install -r requirements.txtpython Studio0808_VoxCPM.pyThis workstation is fully offline — no audio or text is ever uploaded to the cloud.
- Open the app and navigate to the System Settings tab.
- Select a download mirror:
- Hugging Face (recommended for most regions — highest bandwidth)
- ModelScope / Mirror (recommended for users in mainland China)
- Click "Start Download / Check Model".
- The downloader supports HTTP Range requests (resume) and auto-retry. If the connection drops, simply click again to resume from where it left off. Once downloaded, the app can run entirely offline.
- Homophone Substitution — Since the input text is only used to drive speech synthesis (listeners hear the audio, not the text), you can freely swap in a homophone that the model reads correctly.
- Example:
- If the model reads 連假 with the wrong tone, replace it with a homophone like 連架 or 連價.
- If the rare character 狂飆 is mispronounced, swap it for a common homophone like 狂標.
- 3–10 seconds is the sweet spot. Longer reference audio (e.g., > 30 seconds) consumes a disproportionate share of the autoregressive model's attention window, often causing repetition loops, hallucinations, or premature cutoffs in the generated output. Aim for 5–8 seconds of clean, background-music-free speech.
- This project is built on top of OpenBMB's open-source VoxCPM2 model, released under the Apache License 2.0.
- This tool is intended for academic research, testing, and technical evaluation only. Please comply with all applicable laws and regulations — do not use synthesized speech for illegal or unauthorized purposes.
- Copyright © 2026 Studio0808.