🎙️ Studio0808 VoxCPM — AI Text-to-Speech Workstation

An offline desktop GUI application built on top of OpenBMB's state-of-the-art VoxCPM2 — a 2-billion-parameter open-source text-to-speech large language model. It bundles voice design, voice cloning, ultimate cloning, and a built-in live recording studio into one tool, so you can experience high-quality AI speech synthesis on your Windows PC without writing a single line of code.

🌟 Key Features

1. 🎤 Live Voice Recording (NEW!)

Dynamic Microphone Selection — Automatically detects and lists every available audio input device on your system.
Real-Time Waveform Visualization — A built-in waveform monitor shows pink ripple animations while recording, giving you instant visual confirmation that audio is being captured.
One-Click Apply — After recording, apply the WAV file and its transcript directly to the Voice Clone or Ultimate Clone tab with a single click.

2. ✨ Voice Design

Create from Scratch — No reference audio needed. Simply prepend a parenthesized English description to your text (e.g., (A gentle young female voice, smiling)) and the model will generate a unique voice that matches your description.
Quick Preset Panel — Choose from 10+ curated voice presets covering Mandarin, Japanese, English, Korean, and more — one click to apply.
📋 Batch Synthesis (Audiobook Mode) — Paste a long article or novel line-by-line (one sentence per line) and the system synthesizes each line into a separate audio file in the background, preventing memory overflow or context-length crashes.

3. 👥 Voice Clone

Ultra-Short Reference Audio — Only 3–10 seconds of clean vocal recording is needed to quickly replicate a speaker's tonal characteristics.
Randomized Takes — The random seed is deliberately left unfixed by default: the timbre stays consistent, but the pacing, pauses, and intonation vary with every generation, letting you cherry-pick the most natural-sounding take.

4. 👑 Ultimate Clone

Seamless Continuation — Supply both a reference audio clip and its exact transcript. The model treats the reference as a timeline prefix and seamlessly continues speaking your target text, perfectly inheriting the speaker's breathing, pitch contour, ambient noise, and emotion.

⚙️ System Requirements

Item	Recommendation
OS	Windows 10 / 11 (64-bit)
Python	3.8 – 3.11
GPU	NVIDIA GPU with ≥ 6 GB VRAM + CUDA for near-instant generation
Model Size	~2 B parameters; weights ≈ 4.63 GB on disk

Note

If no GPU is available, the application falls back to CPU mode automatically (expect ~10–30× slower generation).

🚀 Quick Start

1. Clone the Repository

git clone https://github.com/begin0808/VoxCPM2.git
cd VoxCPM2

2. Install Dependencies

A clean virtual environment (Anaconda or venv) is strongly recommended.

pip install -r requirements.txt

3. Launch

python Studio0808_VoxCPM.py

📂 Model Deployment (First Run)

This workstation is fully offline — no audio or text is ever uploaded to the cloud.

Open the app and navigate to the System Settings tab.
Select a download mirror:
- Hugging Face (recommended for most regions — highest bandwidth)
- ModelScope / Mirror (recommended for users in mainland China)
Click "Start Download / Check Model".
The downloader supports HTTP Range requests (resume) and auto-retry. If the connection drops, simply click again to resume from where it left off. Once downloaded, the app can run entirely offline.

❓ Tips & FAQ

Q: The model mispronounces certain characters (e.g., polyphonic or rare characters)?

Homophone Substitution — Since the input text is only used to drive speech synthesis (listeners hear the audio, not the text), you can freely swap in a homophone that the model reads correctly.
Example:
- If the model reads 連假 with the wrong tone, replace it with a homophone like 連架 or 連價.
- If the rare character 狂飆 is mispronounced, swap it for a common homophone like 狂標.

Q: What is the ideal reference audio length for voice cloning?

3–10 seconds is the sweet spot. Longer reference audio (e.g., > 30 seconds) consumes a disproportionate share of the autoregressive model's attention window, often causing repetition loops, hallucinations, or premature cutoffs in the generated output. Aim for 5–8 seconds of clean, background-music-free speech.

📜 License & Disclaimer

This project is built on top of OpenBMB's open-source VoxCPM2 model, released under the Apache License 2.0.
This tool is intended for academic research, testing, and technical evaluation only. Please comply with all applicable laws and regulations — do not use synthesized speech for illegal or unauthorized purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
assets		assets
utils		utils
.gitignore		.gitignore
README.md		README.md
README.zh-TW.md		README.zh-TW.md
Studio0808_VoxCPM.py		Studio0808_VoxCPM.py
Studio0808_VoxCPM.spec		Studio0808_VoxCPM.spec
app.ico		app.ico
build.bat		build.bat
index.html		index.html
requirements.txt		requirements.txt
test_seed_patch.py		test_seed_patch.py
test_voxcpm_cli.py		test_voxcpm_cli.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎙️ Studio0808 VoxCPM — AI Text-to-Speech Workstation

🌟 Key Features

1. 🎤 Live Voice Recording (NEW!)

2. ✨ Voice Design

3. 👥 Voice Clone

4. 👑 Ultimate Clone

⚙️ System Requirements

🚀 Quick Start

1. Clone the Repository

2. Install Dependencies

3. Launch

📂 Model Deployment (First Run)

❓ Tips & FAQ

Q: The model mispronounces certain characters (e.g., polyphonic or rare characters)?

Q: What is the ideal reference audio length for voice cloning?

📜 License & Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🎙️ Studio0808 VoxCPM — AI Text-to-Speech Workstation

🌟 Key Features

1. 🎤 Live Voice Recording (NEW!)

2. ✨ Voice Design

3. 👥 Voice Clone

4. 👑 Ultimate Clone

⚙️ System Requirements

🚀 Quick Start

1. Clone the Repository

2. Install Dependencies

3. Launch

📂 Model Deployment (First Run)

❓ Tips & FAQ

Q: The model mispronounces certain characters (e.g., polyphonic or rare characters)?

Q: What is the ideal reference audio length for voice cloning?

📜 License & Disclaimer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages