VoxCPM 1.5 LoRA Fine-Tune β€” Tech Vocabulary

GitHub Recipe

πŸ”— Recipe: voxcpm-recipe

LoRA adapter fine-tuned on top of openbmb/VoxCPM1.5, focused on getting the model to speak tech-heavy text correctly β€” code-ish phrases, symbols, abbreviations, numbers, and domain jargon that base TTS models often mangle.

Checkpoint Info

Run lora_run1
Checkpoint latest (step 2000)
Base model openbmb/VoxCPM1.5
Method LoRA (rank 16, alpha 32)
Dataset size 100 utterances

This is a full training checkpoint β€” it contains the LoRA adapter weights AND optimizer/scheduler state, so you can either run inference directly or resume training.

Files

  • lora_weights.safetensors β€” LoRA adapter weights (load this for inference)
  • lora_config.json β€” LoRA hyperparameters and target modules
  • optimizer.pth β€” optimizer state (for resuming training)
  • scheduler.pth β€” LR scheduler state (for resuming training)

LoRA Config

Param Value
rank (r) 16
alpha 32
dropout 0.1
enable_lm true
enable_dit true
enable_proj false

Target modules (LM & DiT): q_proj, v_proj, k_proj, o_proj

Inference

1. Install

pip install voxcpm
# or, from source: https://github.com/OpenBMB/VoxCPM

2. Download this LoRA checkpoint

from huggingface_hub import snapshot_download

lora_dir = snapshot_download("jeevav62/voxcpm-lora-finetune")
# contains: lora_weights.safetensors, lora_config.json, optimizer.pth, scheduler.pth

3. Load base model + LoRA adapter and generate

import json
import soundfile as sf
from pathlib import Path
from voxcpm import VoxCPM
from voxcpm.modules.layers.lora import LoRAConfig

ckpt_dir = Path(lora_dir)

# Read base_model + lora hyperparameters straight from lora_config.json
with open(ckpt_dir / "lora_config.json") as f:
    lora_info = json.load(f)

base_model   = lora_info["base_model"]            # openbmb/VoxCPM1.5 snapshot
lora_cfg     = LoRAConfig(**lora_info["lora_config"])

model = VoxCPM.from_pretrained(
    hf_model_id="openbmb/VoxCPM1.5",   # base weights
    load_denoiser=False,
    optimize=True,
    lora_config=lora_cfg,
    lora_weights_path=str(ckpt_dir),   # this repo's checkpoint dir
)

wav = model.generate(
    text="The sensor updates several times per second, e.g. 60.5 readings/sec on channel #3.",
    prompt_wav_path=None,     # optional reference WAV for voice cloning
    prompt_text=None,
    cfg_value=2.0,
    inference_timesteps=10,
    normalize=False,
    denoise=False,
)

sf.write("output.wav", wav, model.tts_model.sample_rate)

4. (Optional) A/B compare with/without the adapter

# With LoRA (default, as loaded above)
wav_lora = model.generate(text="...")

# Disable adapter to hear the base model
model.set_lora_enabled(False)
wav_base = model.generate(text="...")

What It Handles Well

  • Reads tech-flavored sentences naturally (sensor readings, training/session jargon, "per second", "hour", numeric phrases)
  • Picked up on consistent pronunciation of recurring technical terms across the dataset
  • Stable voice identity across long technical sentences

Training History β€” Mistakes & Recovery

This run was checked manually against generated samples partway through training:

  • Early checkpoints mispronounced/garbled symbols like # and . (e.g., reading punctuation literally or dropping it inside technical phrases instead of treating it as pause/silence or a known symbol name).
  • These mistakes were caught during a mid-run listening check, and training continued past them β€” by the later steps (close to step 2000 / latest) the model had recovered and produced cleaner, more natural renderings of sentences containing # and ..
  • Takeaway: don't judge a LoRA TTS run from early checkpoints alone β€” symbol/punctuation handling can still be in flux and improve with more steps on the same small dataset.

Dataset

100 short technical/conversational utterances, each paired with a WAV recording, e.g.:

{"audio": "audio1.wav", "text": "The training session lasted for one hour, and after an hour we reviewed the results together."}
{"audio": "audio2.wav", "text": "The sensor updates several times per second, and the readings per second must remain stable during testing."}

Resuming Training

Load optimizer.pth and scheduler.pth alongside lora_weights.safetensors in the VoxCPM LoRA fine-tuning script to continue from step 2000. If resuming, consider lowering the learning rate to stabilize further fine-tuning.

Credits

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jeevav62/voxcpm-lora-finetune

Adapter
(4)
this model