Instructions to use jeevav62/voxcpm-lora-finetune with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use jeevav62/voxcpm-lora-finetune with PEFT:
Task type is invalid.
- VoxCPM
How to use jeevav62/voxcpm-lora-finetune with VoxCPM:
import soundfile as sf from voxcpm import VoxCPM model = VoxCPM.from_pretrained("jeevav62/voxcpm-lora-finetune") wav = model.generate( text="VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech.", prompt_wav_path=None, # optional: path to a prompt speech for voice cloning prompt_text=None, # optional: reference text cfg_value=2.0, # LM guidance on LocDiT, higher for better adherence to the prompt, but maybe worse inference_timesteps=10, # LocDiT inference timesteps, higher for better result, lower for fast speed normalize=True, # enable external TN tool denoise=True, # enable external Denoise tool retry_badcase=True, # enable retrying mode for some bad cases (unstoppable) retry_badcase_max_times=3, # maximum retrying times retry_badcase_ratio_threshold=6.0, # maximum length restriction for bad case detection (simple but effective), it could be adjusted for slow pace speech ) sf.write("output.wav", wav, 16000) print("saved: output.wav") - Notebooks
- Google Colab
- Kaggle
VoxCPM 1.5 LoRA Fine-Tune β Tech Vocabulary
π Recipe: voxcpm-recipe
LoRA adapter fine-tuned on top of openbmb/VoxCPM1.5, focused on getting the model to speak tech-heavy text correctly β code-ish phrases, symbols, abbreviations, numbers, and domain jargon that base TTS models often mangle.
Checkpoint Info
| Run | lora_run1 |
| Checkpoint | latest (step 2000) |
| Base model | openbmb/VoxCPM1.5 |
| Method | LoRA (rank 16, alpha 32) |
| Dataset size | 100 utterances |
This is a full training checkpoint β it contains the LoRA adapter weights AND optimizer/scheduler state, so you can either run inference directly or resume training.
Files
lora_weights.safetensorsβ LoRA adapter weights (load this for inference)lora_config.jsonβ LoRA hyperparameters and target modulesoptimizer.pthβ optimizer state (for resuming training)scheduler.pthβ LR scheduler state (for resuming training)
LoRA Config
| Param | Value |
|---|---|
| rank (r) | 16 |
| alpha | 32 |
| dropout | 0.1 |
| enable_lm | true |
| enable_dit | true |
| enable_proj | false |
Target modules (LM & DiT): q_proj, v_proj, k_proj, o_proj
Inference
1. Install
pip install voxcpm
# or, from source: https://github.com/OpenBMB/VoxCPM
2. Download this LoRA checkpoint
from huggingface_hub import snapshot_download
lora_dir = snapshot_download("jeevav62/voxcpm-lora-finetune")
# contains: lora_weights.safetensors, lora_config.json, optimizer.pth, scheduler.pth
3. Load base model + LoRA adapter and generate
import json
import soundfile as sf
from pathlib import Path
from voxcpm import VoxCPM
from voxcpm.modules.layers.lora import LoRAConfig
ckpt_dir = Path(lora_dir)
# Read base_model + lora hyperparameters straight from lora_config.json
with open(ckpt_dir / "lora_config.json") as f:
lora_info = json.load(f)
base_model = lora_info["base_model"] # openbmb/VoxCPM1.5 snapshot
lora_cfg = LoRAConfig(**lora_info["lora_config"])
model = VoxCPM.from_pretrained(
hf_model_id="openbmb/VoxCPM1.5", # base weights
load_denoiser=False,
optimize=True,
lora_config=lora_cfg,
lora_weights_path=str(ckpt_dir), # this repo's checkpoint dir
)
wav = model.generate(
text="The sensor updates several times per second, e.g. 60.5 readings/sec on channel #3.",
prompt_wav_path=None, # optional reference WAV for voice cloning
prompt_text=None,
cfg_value=2.0,
inference_timesteps=10,
normalize=False,
denoise=False,
)
sf.write("output.wav", wav, model.tts_model.sample_rate)
4. (Optional) A/B compare with/without the adapter
# With LoRA (default, as loaded above)
wav_lora = model.generate(text="...")
# Disable adapter to hear the base model
model.set_lora_enabled(False)
wav_base = model.generate(text="...")
What It Handles Well
- Reads tech-flavored sentences naturally (sensor readings, training/session jargon, "per second", "hour", numeric phrases)
- Picked up on consistent pronunciation of recurring technical terms across the dataset
- Stable voice identity across long technical sentences
Training History β Mistakes & Recovery
This run was checked manually against generated samples partway through training:
- Early checkpoints mispronounced/garbled symbols like
#and.(e.g., reading punctuation literally or dropping it inside technical phrases instead of treating it as pause/silence or a known symbol name). - These mistakes were caught during a mid-run listening check, and training continued past them β by the later steps (close to step 2000 /
latest) the model had recovered and produced cleaner, more natural renderings of sentences containing#and.. - Takeaway: don't judge a LoRA TTS run from early checkpoints alone β symbol/punctuation handling can still be in flux and improve with more steps on the same small dataset.
Dataset
100 short technical/conversational utterances, each paired with a WAV recording, e.g.:
{"audio": "audio1.wav", "text": "The training session lasted for one hour, and after an hour we reviewed the results together."}
{"audio": "audio2.wav", "text": "The sensor updates several times per second, and the readings per second must remain stable during testing."}
Resuming Training
Load optimizer.pth and scheduler.pth alongside lora_weights.safetensors in the VoxCPM LoRA fine-tuning script to continue from step 2000. If resuming, consider lowering the learning rate to stabilize further fine-tuning.
Credits
- Base model: openbmb/VoxCPM1.5
- Training/inference recipe: VoxCPM
- Downloads last month
- -