🎙️ VoxCPM2 - Full Capability Tools

Complete toolkit for openbmb/VoxCPM2 — a 2B parameter tokenizer-free diffusion TTS model with voice cloning, voice design, and multilingual synthesis.

📦 What's Included

File	Purpose
`voxcpm2_local_laptop.py`	Local inference script — optimized for Ryzen 7 + 16GB RAM CPU
`VoxCPM2_Colab_Notebook.ipynb`	Google Colab notebook — free T4 GPU, all capabilities + Gradio UI
`README.md`	This file — full documentation

🚀 Quick Start (Google Colab — FREE)

Open the notebook in Colab:
- Download: VoxCPM2_Colab_Notebook.ipynb
- Or open directly: Open in Colab
Run all cells top to bottom — installs voxcpm, downloads ~4.6GB model, then:
- 🔊 Basic TTS
- 🎨 Voice Design
- 🌐 Multilingual (30+ languages)
- 👤 Zero-Shot Voice Cloning (upload your voice!)
- 🎵 Hi-Fi Ultimate Cloning
- 📡 Streaming Generation
- 🖥️ Interactive Gradio UI with public URL
GPU memory optimized for free T4 tier (~8GB VRAM used out of 16GB)

💻 Local Laptop (Ryzen 7 + 16GB RAM)

Install

pip install voxcpm soundfile torch

Run All Demos

python voxcpm2_local_laptop.py --demo

Run Specific Modes

# Basic TTS
python voxcpm2_local_laptop.py --text "Hello world"

# Voice Design (natural language voice control)
python voxcpm2_local_laptop.py --mode design \
  --description "warm female voice" \
  --text "Hello there"

# Voice Cloning (needs reference WAV file)
python voxcpm2_local_laptop.py --mode clone \
  --text "This is my cloned voice" \
  --reference my_voice.wav

# Multilingual demo
python voxcpm2_local_laptop.py --mode multilingual

# Speed mode (lower timesteps = faster)
python voxcpm2_local_laptop.py --text "Hello" --timesteps 5

🎛️ All Capabilities Explained

1. 🔊 Basic TTS

Just text → audio. Fastest mode. 48kHz studio-quality output.

2. 🎨 Voice Design

Control voice characteristics with natural language descriptions:

"(A young woman, gentle and soothing voice) Hello!"
"(A deep male narrator, professional tone) Welcome."
"(A robot, monotone synthetic voice) System online."

3. 👤 Zero-Shot Voice Cloning

Clone ANY voice from a 3-10 second audio sample. Upload a WAV and the model mimics the speaker perfectly.

4. 🎵 Hi-Fi Ultimate Cloning

Best quality cloning combining:

Prompt audio + transcript (for prosody/style)
Reference audio (for timbre)

5. 🌐 Multilingual (30+ Languages)

No language tags needed. Just write in the target language:

English, Chinese, Spanish, French, German, Japanese, Korean
Arabic, Hindi, Portuguese, Russian, Italian, Dutch, Polish
Turkish, Vietnamese, Thai, Indonesian, and more

6. 📡 Streaming Generation

Generate long texts chunk-by-chunk. Memory-efficient for audiobooks.

⚙️ Speed vs Quality

Timesteps	Quality	Speed	Best For
4-5	Draft	⚡ Fast	Testing
8-10	Good	🚀 Normal	Default, balanced
15-20	High	🐢 Slow	Voice cloning
25-30	Best	🐌 Very Slow	Audiobooks

Parameter: inference_timesteps — lower = faster, higher = better quality.

🖥️ Hardware Requirements

Setup	VRAM/RAM	Feasibility	Notes
Colab T4 (Free)	16GB GPU	✅ Perfect	`load_denoiser=False` saves ~500MB
Ryzen 7 + 16GB RAM	16GB CPU	✅ Works	CPU mode, slower but functional
RTX 3060/4060	12GB GPU	✅ Good	Same settings as Colab
Apple Silicon M1-M3	Unified	✅ Works	`device="mps"`

🔧 Memory Optimizations Applied

Both scripts use these settings to fit in 16GB:

load_denoiser=False — Skip ZipEnhancer (~500MB saved)
optimize=False on CPU — Skip torch.compile overhead
optimize=True on GPU — Enable torch.compile for speed
device="auto" / "cpu" / "cuda" — Proper device selection

📚 Model Info

Parameters: 2B
Architecture: MiniCPM-4 → LocEnc → TSLM → RALM → LocDiT → AudioVAE V2
Output: 48kHz WAV
License: Apache-2.0 (commercial use OK)
Paper: arXiv:2509.24650

📝 License

These scripts are provided as-is for personal/educational use. The VoxCPM2 model is Apache-2.0 licensed.

🔗 Links

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for SWAG456/voxcpm2-tools

VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning

Paper • 2509.24650 • Published Sep 29, 2025 • 11