Instructions to use voidful/SRFD-VoxCPM2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- VoxCPM
How to use voidful/SRFD-VoxCPM2 with VoxCPM:
import soundfile as sf from voxcpm import VoxCPM model = VoxCPM.from_pretrained("voidful/SRFD-VoxCPM2") wav = model.generate( text="VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech.", prompt_wav_path=None, # optional: path to a prompt speech for voice cloning prompt_text=None, # optional: reference text cfg_value=2.0, # LM guidance on LocDiT, higher for better adherence to the prompt, but maybe worse inference_timesteps=10, # LocDiT inference timesteps, higher for better result, lower for fast speed normalize=True, # enable external TN tool denoise=True, # enable external Denoise tool retry_badcase=True, # enable retrying mode for some bad cases (unstoppable) retry_badcase_max_times=3, # maximum retrying times retry_badcase_ratio_threshold=6.0, # maximum length restriction for bad case detection (simple but effective), it could be adjusted for slow pace speech ) sf.write("output.wav", wav, 16000) print("saved: output.wav") - Notebooks
- Google Colab
- Kaggle
SRFD-VoxCPM2
SRFD-VoxCPM2 is an adapter-only release for openbmb/VoxCPM2. It keeps the VoxCPM2 base model unchanged and provides VoxCPM LoRA weights trained with Speech Representation Frechet Distance (SR-FD), a training-time distributional regularizer for true four-step TTS.
This repository does not contain the 2B VoxCPM2 base weights. Download
openbmb/VoxCPM2 separately and load these adapters on top of it.
Released Adapters
| Adapter | Path | Removed FD target | Step | Seed-TTS EN WER | UTMOS / DNSMOS OVRL / P808 |
|---|---|---|---|---|---|
| Compact 3-target SR-FD | . and adapters/compact3_balanced/ |
none | 1600 | 167/11805 = 1.4147% |
3.7637 / 3.0711 / 3.6507 |
| Remove ASR-good Whisper | ablations/remove_asr_true4_good_whisper/ |
asr_true4_good_whisper |
1600 | 182/11805 = 1.5417% |
3.7650 / 3.0754 / 3.6545 |
| Remove real CTC | ablations/remove_real_ctc_content/ |
real_ctc_content |
1000 | 176/11805 = 1.4909% |
3.7609 / 3.0731 / 3.6535 |
| Remove teacher CTC | ablations/remove_teacher_t10_ctc_content/ |
teacher_t10_ctc_content |
900 | 175/11805 = 1.4824% |
3.7604 / 3.0756 / 3.6541 |
The compact three-target model is the default adapter and is duplicated at the repository root for convenience.
Compact SR-FD Targets
The final compact model uses three content-centered FD targets:
asr_true4_good_whisper: Whisper content statistics from ASR-reranked good true-four-step generations.teacher_t10_ctc_content: CTC posterior statistics from ten-step VoxCPM2 teacher generations.real_ctc_content: CTC posterior statistics from real LibriTTS voice-cloning speech.
The leave-one-out adapters remove one of these targets while keeping the rest of the compact recipe unchanged. They are intended for ablation and paper reproducibility, not as recommended deployment checkpoints.
Repository Layout
| Path | Description |
|---|---|
lora_weights.safetensors |
Default compact 3-target SR-FD adapter |
lora_config.json |
Custom VoxCPM LoRA config for the default adapter |
training_state.json |
Training step marker for the default adapter |
adapters/compact3_balanced/ |
Explicit copy of the default adapter |
ablations/remove_asr_true4_good_whisper/ |
Leave-one-out adapter without the Whisper low-step target |
ablations/remove_real_ctc_content/ |
Leave-one-out adapter without the real-speech CTC target |
ablations/remove_teacher_t10_ctc_content/ |
Leave-one-out adapter without the ten-step teacher CTC target |
configs/ |
Training configs used for the compact model and ablations |
reports/ |
Upstream WER, UTMOS, DNSMOS, and ablation summaries |
metadata/adapter_index.json |
Machine-readable adapter index with hashes and source checkpoints |
lora_config.json is a custom VoxCPM LoRA config. It is not a PEFT
adapter_config.json.
Quick Start
Install VoxCPM and helper packages:
pip install voxcpm huggingface_hub soundfile
Load the base model and the default SR-FD adapter:
import json
import os
import soundfile as sf
from huggingface_hub import snapshot_download
from voxcpm import VoxCPM
from voxcpm.model.voxcpm import LoRAConfig
base_model = "openbmb/VoxCPM2"
adapter_dir = snapshot_download("voidful/SRFD-VoxCPM2")
with open(os.path.join(adapter_dir, "lora_config.json"), "r", encoding="utf-8") as f:
adapter_info = json.load(f)
lora_config = LoRAConfig(**adapter_info["lora_config"])
model = VoxCPM.from_pretrained(
hf_model_id=base_model,
load_denoiser=False,
optimize=True,
lora_config=lora_config,
lora_weights_path=adapter_dir,
)
wav = model.generate(
text="SR-FD improves true four-step VoxCPM2 synthesis.",
cfg_value=2.35,
inference_timesteps=4,
normalize=True,
)
sf.write("srfd_voxcpm2.wav", wav, model.tts_model.sample_rate)
Use an ablation adapter by pointing the LoRA loader to an ablation subfolder:
ablation_dir = os.path.join(adapter_dir, "ablations", "remove_asr_true4_good_whisper")
model.load_lora(ablation_dir)
Evaluation Notes
The headline metric is upstream Seed-TTS English WER on 1,088 prompts with 11,805 paper-facing reference words. UTMOS and DNSMOS are objective proxies, not human MOS. The compact 3-target adapter matches the 9-target SR-FD WER frontier while making the FD target story simpler and easier to reproduce.
License
This adapter release follows the Apache-2.0 license terms of the VoxCPM2 base
model. See openbmb/VoxCPM2 for the original model card and usage restrictions.
Model tree for voidful/SRFD-VoxCPM2
Base model
openbmb/VoxCPM2