YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
- ποΈ VoxCPM2 - Full Capability Tools
ποΈ VoxCPM2 - Full Capability Tools
Complete toolkit for openbmb/VoxCPM2 β a 2B parameter tokenizer-free diffusion TTS model with voice cloning, voice design, and multilingual synthesis.
π¦ What's Included
| File | Purpose |
|---|---|
voxcpm2_local_laptop.py |
Local inference script β optimized for Ryzen 7 + 16GB RAM CPU |
VoxCPM2_Colab_Notebook.ipynb |
Google Colab notebook β free T4 GPU, all capabilities + Gradio UI |
README.md |
This file β full documentation |
π Quick Start (Google Colab β FREE)
Open the notebook in Colab:
- Download:
VoxCPM2_Colab_Notebook.ipynb - Or open directly: Open in Colab
- Download:
Run all cells top to bottom β installs
voxcpm, downloads ~4.6GB model, then:- π Basic TTS
- π¨ Voice Design
- π Multilingual (30+ languages)
- π€ Zero-Shot Voice Cloning (upload your voice!)
- π΅ Hi-Fi Ultimate Cloning
- π‘ Streaming Generation
- π₯οΈ Interactive Gradio UI with public URL
GPU memory optimized for free T4 tier (~8GB VRAM used out of 16GB)
π» Local Laptop (Ryzen 7 + 16GB RAM)
Install
pip install voxcpm soundfile torch
Run All Demos
python voxcpm2_local_laptop.py --demo
Run Specific Modes
# Basic TTS
python voxcpm2_local_laptop.py --text "Hello world"
# Voice Design (natural language voice control)
python voxcpm2_local_laptop.py --mode design \
--description "warm female voice" \
--text "Hello there"
# Voice Cloning (needs reference WAV file)
python voxcpm2_local_laptop.py --mode clone \
--text "This is my cloned voice" \
--reference my_voice.wav
# Multilingual demo
python voxcpm2_local_laptop.py --mode multilingual
# Speed mode (lower timesteps = faster)
python voxcpm2_local_laptop.py --text "Hello" --timesteps 5
ποΈ All Capabilities Explained
1. π Basic TTS
Just text β audio. Fastest mode. 48kHz studio-quality output.
2. π¨ Voice Design
Control voice characteristics with natural language descriptions:
"(A young woman, gentle and soothing voice) Hello!"
"(A deep male narrator, professional tone) Welcome."
"(A robot, monotone synthetic voice) System online."
3. π€ Zero-Shot Voice Cloning
Clone ANY voice from a 3-10 second audio sample. Upload a WAV and the model mimics the speaker perfectly.
4. π΅ Hi-Fi Ultimate Cloning
Best quality cloning combining:
- Prompt audio + transcript (for prosody/style)
- Reference audio (for timbre)
5. π Multilingual (30+ Languages)
No language tags needed. Just write in the target language:
- English, Chinese, Spanish, French, German, Japanese, Korean
- Arabic, Hindi, Portuguese, Russian, Italian, Dutch, Polish
- Turkish, Vietnamese, Thai, Indonesian, and more
6. π‘ Streaming Generation
Generate long texts chunk-by-chunk. Memory-efficient for audiobooks.
βοΈ Speed vs Quality
| Timesteps | Quality | Speed | Best For |
|---|---|---|---|
| 4-5 | Draft | β‘ Fast | Testing |
| 8-10 | Good | π Normal | Default, balanced |
| 15-20 | High | π’ Slow | Voice cloning |
| 25-30 | Best | π Very Slow | Audiobooks |
Parameter: inference_timesteps β lower = faster, higher = better quality.
π₯οΈ Hardware Requirements
| Setup | VRAM/RAM | Feasibility | Notes |
|---|---|---|---|
| Colab T4 (Free) | 16GB GPU | β Perfect | load_denoiser=False saves ~500MB |
| Ryzen 7 + 16GB RAM | 16GB CPU | β Works | CPU mode, slower but functional |
| RTX 3060/4060 | 12GB GPU | β Good | Same settings as Colab |
| Apple Silicon M1-M3 | Unified | β Works | device="mps" |
π§ Memory Optimizations Applied
Both scripts use these settings to fit in 16GB:
load_denoiser=Falseβ Skip ZipEnhancer (~500MB saved)optimize=Falseon CPU β Skip torch.compile overheadoptimize=Trueon GPU β Enable torch.compile for speeddevice="auto"/"cpu"/"cuda"β Proper device selection
π Model Info
- Parameters: 2B
- Architecture: MiniCPM-4 β LocEnc β TSLM β RALM β LocDiT β AudioVAE V2
- Output: 48kHz WAV
- License: Apache-2.0 (commercial use OK)
- Paper: arXiv:2509.24650
π License
These scripts are provided as-is for personal/educational use. The VoxCPM2 model is Apache-2.0 licensed.