ViZipVoice

Vietnamese zero-shot TTS / voice cloning fine-tuned from ZipVoice.

The wrapper loads the largest checkpoint-<step>.pt automatically and uses soe-vinorm for Vietnamese text normalization.

Audio Demo

Generated with checkpoint-700000.pt, the current wrapper flow, and the demo text in demo/demo_text.txt.

Đinh-Quyết

Open audio

Nhã-Uyên

Open audio

MC

Open audio

Install

git clone https://github.com/iamdinhthuan/ViZipvoice.git
cd ViZipvoice
pip install -r requirements.txt
export PYTHONPATH="$PWD:$PYTHONPATH"

CLI

python3 -m zipvoice.bin.infer_vizipvoice \
  --prompt-wav prompt.wav \
  --prompt-text "Xin chào, đây là giọng mẫu của tôi." \
  --text "ViZipVoice có thể tổng hợp giọng nói tiếng Việt từ một đoạn mẫu ngắn." \
  --res-wav-path output.wav

The CLI downloads this model repo by default. Use --model-dir models/ViZipvoice after downloading files locally.

Python

from zipvoice.vizipvoice import ViZipVoiceTTS

tts = ViZipVoiceTTS()
metrics = tts.synthesize(
    prompt_wav="prompt.wav",
    prompt_text="Xin chào, đây là giọng mẫu của tôi.",
    text="Đây là câu tiếng Việt được sinh bởi ViZipVoice.",
    output_path="output.wav",
)
print(metrics)

Reference Audio

audio/ contains 30 reference prompts. Each audio file has a sidecar .txt transcript with the same basename:

audio/Đinh-Quyết.mp3
audio/Đinh-Quyết.txt

Names only keep the audio/person name; the original lar_* prefix and Pro suffix are removed. The Gradio app reads this sidecar format automatically.

huggingface-cli download contextboxai/ViZipvoice \
  --local-dir models/ViZipvoice \
  --local-dir-use-symlinks False

python3 egs/zipvoice/gradio_app.py --exp-dir models/ViZipvoice

Inference Flow

The CLI, Python wrapper, and Gradio app use the same default flow:

  • normalize Vietnamese text with soe-vinorm, then clean spaces around punctuation;
  • split long text into sentences;
  • for a 1-word sentence: use at least 24 steps and speed=0.6;
  • for a 2-4 word sentence: use speed=0.8;
  • generate each segment separately;
  • merge segments with silence, crossfade, fade in, and fade out.

Useful knobs:

--no-vietnamese-normalize
--no-split-sentences
--crossfade-ms 80
--silence-ms 180
--fade-in-ms 20
--fade-out-ms 80

Files

  • checkpoint-920000.pt: latest FP16 checkpoint
  • checkpoint-700000.pt: previous FP16 checkpoint used for the current demo audios
  • config.json, model.json: model config
  • tokens.txt: Vietnamese character tokenizer
  • audio/: 30 reference audios plus .txt transcripts
  • demo/: regenerated audio demos and metadata.json
  • vizipvoice.py: wrapper mirrored from GitHub

Responsible Use

This model can clone voices from short audio prompts. Use only voices you own or have explicit permission to use. Do not use it for impersonation, fraud, harassment, misinformation, or other harmful content.

License

Apache License 2.0. Please also credit the original ZipVoice project.

Downloads last month
615
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for contextboxai/ViZipvoice

Base model

k2-fsa/ZipVoice
Finetuned
(2)
this model

Space using contextboxai/ViZipvoice 1