You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

SRFD-VoxCPM2

SRFD-VoxCPM2 is an adapter-only release for openbmb/VoxCPM2. It keeps the VoxCPM2 base model unchanged and provides VoxCPM LoRA weights trained with Speech Representation Frechet Distance (SR-FD), a training-time distributional regularizer for true four-step TTS.

This repository does not contain the 2B VoxCPM2 base weights. Download openbmb/VoxCPM2 separately and load these adapters on top of it.

Released Adapters

Adapter Path Removed FD target Step Seed-TTS EN WER UTMOS / DNSMOS OVRL / P808
Compact 3-target SR-FD . and adapters/compact3_balanced/ none 1600 167/11805 = 1.4147% 3.7637 / 3.0711 / 3.6507
Remove ASR-good Whisper ablations/remove_asr_true4_good_whisper/ asr_true4_good_whisper 1600 182/11805 = 1.5417% 3.7650 / 3.0754 / 3.6545
Remove real CTC ablations/remove_real_ctc_content/ real_ctc_content 1000 176/11805 = 1.4909% 3.7609 / 3.0731 / 3.6535
Remove teacher CTC ablations/remove_teacher_t10_ctc_content/ teacher_t10_ctc_content 900 175/11805 = 1.4824% 3.7604 / 3.0756 / 3.6541

The compact three-target model is the default adapter and is duplicated at the repository root for convenience.

Compact SR-FD Targets

The final compact model uses three content-centered FD targets:

  1. asr_true4_good_whisper: Whisper content statistics from ASR-reranked good true-four-step generations.
  2. teacher_t10_ctc_content: CTC posterior statistics from ten-step VoxCPM2 teacher generations.
  3. real_ctc_content: CTC posterior statistics from real LibriTTS voice-cloning speech.

The leave-one-out adapters remove one of these targets while keeping the rest of the compact recipe unchanged. They are intended for ablation and paper reproducibility, not as recommended deployment checkpoints.

Repository Layout

Path Description
lora_weights.safetensors Default compact 3-target SR-FD adapter
lora_config.json Custom VoxCPM LoRA config for the default adapter
training_state.json Training step marker for the default adapter
adapters/compact3_balanced/ Explicit copy of the default adapter
ablations/remove_asr_true4_good_whisper/ Leave-one-out adapter without the Whisper low-step target
ablations/remove_real_ctc_content/ Leave-one-out adapter without the real-speech CTC target
ablations/remove_teacher_t10_ctc_content/ Leave-one-out adapter without the ten-step teacher CTC target
configs/ Training configs used for the compact model and ablations
reports/ Upstream WER, UTMOS, DNSMOS, and ablation summaries
metadata/adapter_index.json Machine-readable adapter index with hashes and source checkpoints

lora_config.json is a custom VoxCPM LoRA config. It is not a PEFT adapter_config.json.

Quick Start

Install VoxCPM and helper packages:

pip install voxcpm huggingface_hub soundfile

Load the base model and the default SR-FD adapter:

import json
import os

import soundfile as sf
from huggingface_hub import snapshot_download
from voxcpm import VoxCPM
from voxcpm.model.voxcpm import LoRAConfig

base_model = "openbmb/VoxCPM2"
adapter_dir = snapshot_download("voidful/SRFD-VoxCPM2")

with open(os.path.join(adapter_dir, "lora_config.json"), "r", encoding="utf-8") as f:
    adapter_info = json.load(f)

lora_config = LoRAConfig(**adapter_info["lora_config"])

model = VoxCPM.from_pretrained(
    hf_model_id=base_model,
    load_denoiser=False,
    optimize=True,
    lora_config=lora_config,
    lora_weights_path=adapter_dir,
)

wav = model.generate(
    text="SR-FD improves true four-step VoxCPM2 synthesis.",
    cfg_value=2.35,
    inference_timesteps=4,
    normalize=True,
)

sf.write("srfd_voxcpm2.wav", wav, model.tts_model.sample_rate)

Use an ablation adapter by pointing the LoRA loader to an ablation subfolder:

ablation_dir = os.path.join(adapter_dir, "ablations", "remove_asr_true4_good_whisper")
model.load_lora(ablation_dir)

Evaluation Notes

The headline metric is upstream Seed-TTS English WER on 1,088 prompts with 11,805 paper-facing reference words. UTMOS and DNSMOS are objective proxies, not human MOS. The compact 3-target adapter matches the 9-target SR-FD WER frontier while making the FD target story simpler and easier to reproduce.

License

This adapter release follows the Apache-2.0 license terms of the VoxCPM2 base model. See openbmb/VoxCPM2 for the original model card and usage restrictions.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for voidful/SRFD-VoxCPM2

Base model

openbmb/VoxCPM2
Adapter
(3)
this model