VoxCPM 1.5 LoRA Fine-Tune — Tech Vocabulary

🔗 Recipe: voxcpm-recipe

LoRA adapter fine-tuned on top of openbmb/VoxCPM1.5, focused on getting the model to speak tech-heavy text correctly — code-ish phrases, symbols, abbreviations, numbers, and domain jargon that base TTS models often mangle.

Checkpoint Info


Run	`lora_run1`
Checkpoint	`latest` (step 2000)
Base model	`openbmb/VoxCPM1.5`
Method	LoRA (rank 16, alpha 32)
Dataset size	100 utterances

This is a full training checkpoint — it contains the LoRA adapter weights AND optimizer/scheduler state, so you can either run inference directly or resume training.

Files

lora_weights.safetensors — LoRA adapter weights (load this for inference)
lora_config.json — LoRA hyperparameters and target modules
optimizer.pth — optimizer state (for resuming training)
scheduler.pth — LR scheduler state (for resuming training)

LoRA Config

Param	Value
rank (r)	16
alpha	32
dropout	0.1
enable_lm	true
enable_dit	true
enable_proj	false

Target modules (LM & DiT): q_proj, v_proj, k_proj, o_proj

Inference

1. Install

pip install voxcpm
# or, from source: https://github.com/OpenBMB/VoxCPM

2. Download this LoRA checkpoint

from huggingface_hub import snapshot_download

lora_dir = snapshot_download("jeevav62/voxcpm-lora-finetune")
# contains: lora_weights.safetensors, lora_config.json, optimizer.pth, scheduler.pth

3. Load base model + LoRA adapter and generate

import json
import soundfile as sf
from pathlib import Path
from voxcpm import VoxCPM
from voxcpm.modules.layers.lora import LoRAConfig

ckpt_dir = Path(lora_dir)

# Read base_model + lora hyperparameters straight from lora_config.json
with open(ckpt_dir / "lora_config.json") as f:
    lora_info = json.load(f)

base_model   = lora_info["base_model"]            # openbmb/VoxCPM1.5 snapshot
lora_cfg     = LoRAConfig(**lora_info["lora_config"])

model = VoxCPM.from_pretrained(
    hf_model_id="openbmb/VoxCPM1.5",   # base weights
    load_denoiser=False,
    optimize=True,
    lora_config=lora_cfg,
    lora_weights_path=str(ckpt_dir),   # this repo's checkpoint dir
)

wav = model.generate(
    text="The sensor updates several times per second, e.g. 60.5 readings/sec on channel #3.",
    prompt_wav_path=None,     # optional reference WAV for voice cloning
    prompt_text=None,
    cfg_value=2.0,
    inference_timesteps=10,
    normalize=False,
    denoise=False,
)

sf.write("output.wav", wav, model.tts_model.sample_rate)

4. (Optional) A/B compare with/without the adapter

# With LoRA (default, as loaded above)
wav_lora = model.generate(text="...")

# Disable adapter to hear the base model
model.set_lora_enabled(False)
wav_base = model.generate(text="...")

What It Handles Well

Reads tech-flavored sentences naturally (sensor readings, training/session jargon, "per second", "hour", numeric phrases)
Picked up on consistent pronunciation of recurring technical terms across the dataset
Stable voice identity across long technical sentences

Training History — Mistakes & Recovery

This run was checked manually against generated samples partway through training:

Early checkpoints mispronounced/garbled symbols like # and . (e.g., reading punctuation literally or dropping it inside technical phrases instead of treating it as pause/silence or a known symbol name).
These mistakes were caught during a mid-run listening check, and training continued past them — by the later steps (close to step 2000 / latest) the model had recovered and produced cleaner, more natural renderings of sentences containing # and ..
Takeaway: don't judge a LoRA TTS run from early checkpoints alone — symbol/punctuation handling can still be in flux and improve with more steps on the same small dataset.

Dataset

100 short technical/conversational utterances, each paired with a WAV recording, e.g.:

{"audio": "audio1.wav", "text": "The training session lasted for one hour, and after an hour we reviewed the results together."}
{"audio": "audio2.wav", "text": "The sensor updates several times per second, and the readings per second must remain stable during testing."}

Resuming Training

Load optimizer.pth and scheduler.pth alongside lora_weights.safetensors in the VoxCPM LoRA fine-tuning script to continue from step 2000. If resuming, consider lowering the learning rate to stabilize further fine-tuning.

Credits

Base model: openbmb/VoxCPM1.5
Training/inference recipe: VoxCPM

Downloads last month: -

Model tree for jeevav62/voxcpm-lora-finetune

Base model

openbmb/MiniCPM4-0.5B

Finetuned

openbmb/VoxCPM1.5

Adapter

(4)

this model