You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

SRFD-VoxCPM2

SRFD-VoxCPM2 is an adapter-only release for openbmb/VoxCPM2. It keeps the VoxCPM2 base model unchanged and provides VoxCPM LoRA weights trained with Speech Representation Frechet Distance (SR-FD), a training-time distributional regularizer for true four-step TTS.

This repository does not contain the 2B VoxCPM2 base weights. Download openbmb/VoxCPM2 separately and load these adapters on top of it.

Released Adapters

Adapter	Path	Removed FD target	Step	Seed-TTS EN WER	UTMOS / DNSMOS OVRL / P808
Compact 3-target SR-FD	`.` and `adapters/compact3_balanced/`	none	1600	`167/11805 = 1.4147%`	`3.7637 / 3.0711 / 3.6507`
Remove ASR-good Whisper	`ablations/remove_asr_true4_good_whisper/`	`asr_true4_good_whisper`	1600	`182/11805 = 1.5417%`	`3.7650 / 3.0754 / 3.6545`
Remove real CTC	`ablations/remove_real_ctc_content/`	`real_ctc_content`	1000	`176/11805 = 1.4909%`	`3.7609 / 3.0731 / 3.6535`
Remove teacher CTC	`ablations/remove_teacher_t10_ctc_content/`	`teacher_t10_ctc_content`	900	`175/11805 = 1.4824%`	`3.7604 / 3.0756 / 3.6541`

The compact three-target model is the default adapter and is duplicated at the repository root for convenience.

Compact SR-FD Targets

The final compact model uses three content-centered FD targets:

asr_true4_good_whisper: Whisper content statistics from ASR-reranked good true-four-step generations.
teacher_t10_ctc_content: CTC posterior statistics from ten-step VoxCPM2 teacher generations.
real_ctc_content: CTC posterior statistics from real LibriTTS voice-cloning speech.

The leave-one-out adapters remove one of these targets while keeping the rest of the compact recipe unchanged. They are intended for ablation and paper reproducibility, not as recommended deployment checkpoints.

Repository Layout

Path	Description
`lora_weights.safetensors`	Default compact 3-target SR-FD adapter
`lora_config.json`	Custom VoxCPM LoRA config for the default adapter
`training_state.json`	Training step marker for the default adapter
`adapters/compact3_balanced/`	Explicit copy of the default adapter
`ablations/remove_asr_true4_good_whisper/`	Leave-one-out adapter without the Whisper low-step target
`ablations/remove_real_ctc_content/`	Leave-one-out adapter without the real-speech CTC target
`ablations/remove_teacher_t10_ctc_content/`	Leave-one-out adapter without the ten-step teacher CTC target
`configs/`	Training configs used for the compact model and ablations
`reports/`	Upstream WER, UTMOS, DNSMOS, and ablation summaries
`metadata/adapter_index.json`	Machine-readable adapter index with hashes and source checkpoints

lora_config.json is a custom VoxCPM LoRA config. It is not a PEFT adapter_config.json.

Quick Start

Install VoxCPM and helper packages:

pip install voxcpm huggingface_hub soundfile

Load the base model and the default SR-FD adapter:

import json
import os

import soundfile as sf
from huggingface_hub import snapshot_download
from voxcpm import VoxCPM
from voxcpm.model.voxcpm import LoRAConfig

base_model = "openbmb/VoxCPM2"
adapter_dir = snapshot_download("voidful/SRFD-VoxCPM2")

with open(os.path.join(adapter_dir, "lora_config.json"), "r", encoding="utf-8") as f:
    adapter_info = json.load(f)

lora_config = LoRAConfig(**adapter_info["lora_config"])

model = VoxCPM.from_pretrained(
    hf_model_id=base_model,
    load_denoiser=False,
    optimize=True,
    lora_config=lora_config,
    lora_weights_path=adapter_dir,
)

wav = model.generate(
    text="SR-FD improves true four-step VoxCPM2 synthesis.",
    cfg_value=2.35,
    inference_timesteps=4,
    normalize=True,
)

sf.write("srfd_voxcpm2.wav", wav, model.tts_model.sample_rate)

Use an ablation adapter by pointing the LoRA loader to an ablation subfolder:

ablation_dir = os.path.join(adapter_dir, "ablations", "remove_asr_true4_good_whisper")
model.load_lora(ablation_dir)

Evaluation Notes

The headline metric is upstream Seed-TTS English WER on 1,088 prompts with 11,805 paper-facing reference words. UTMOS and DNSMOS are objective proxies, not human MOS. The compact 3-target adapter matches the 9-target SR-FD WER frontier while making the FD target story simpler and easier to reproduce.

License

This adapter release follows the Apache-2.0 license terms of the VoxCPM2 base model. See openbmb/VoxCPM2 for the original model card and usage restrictions.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for voidful/SRFD-VoxCPM2

Base model

openbmb/VoxCPM2

Adapter

(3)

this model