ViZipVoice

Vietnamese zero-shot TTS / voice cloning fine-tuned from ZipVoice.

GitHub: https://github.com/iamdinhthuan/ViZipvoice
Model repo: https://huggingface.co/contextboxai/ViZipvoice
Space: https://huggingface.co/spaces/dinhthuan/ViZipvoice
Latest checkpoint: checkpoint-920000.pt, FP16 inference state dict
Training data: about 7000 total hours, including roughly 6500 hours of Vietnamese and 500 hours of English
Tokenizer: SimpleTokenizer, character-level, 244 tokens
Sample rate: 24 kHz
Default vocoder: charactr/vocos-mel-24khz

The wrapper loads the largest checkpoint-<step>.pt automatically and uses soe-vinorm for Vietnamese text normalization.

Audio Demo

Generated with checkpoint-700000.pt, the current wrapper flow, and the demo text in demo/demo_text.txt.

Đinh-Quyết

Open audio

Nhã-Uyên

Open audio

Install

git clone https://github.com/iamdinhthuan/ViZipvoice.git
cd ViZipvoice
pip install -r requirements.txt
export PYTHONPATH="$PWD:$PYTHONPATH"

CLI

python3 -m zipvoice.bin.infer_vizipvoice \
  --prompt-wav prompt.wav \
  --prompt-text "Xin chào, đây là giọng mẫu của tôi." \
  --text "ViZipVoice có thể tổng hợp giọng nói tiếng Việt từ một đoạn mẫu ngắn." \
  --res-wav-path output.wav

The CLI downloads this model repo by default. Use --model-dir models/ViZipvoice after downloading files locally.

Python

from zipvoice.vizipvoice import ViZipVoiceTTS

tts = ViZipVoiceTTS()
metrics = tts.synthesize(
    prompt_wav="prompt.wav",
    prompt_text="Xin chào, đây là giọng mẫu của tôi.",
    text="Đây là câu tiếng Việt được sinh bởi ViZipVoice.",
    output_path="output.wav",
)
print(metrics)

Reference Audio

audio/ contains 30 reference prompts. Each audio file has a sidecar .txt transcript with the same basename:

audio/Đinh-Quyết.mp3
audio/Đinh-Quyết.txt

Names only keep the audio/person name; the original lar_* prefix and Pro suffix are removed. The Gradio app reads this sidecar format automatically.

huggingface-cli download contextboxai/ViZipvoice \
  --local-dir models/ViZipvoice \
  --local-dir-use-symlinks False

python3 egs/zipvoice/gradio_app.py --exp-dir models/ViZipvoice

Inference Flow

The CLI, Python wrapper, and Gradio app use the same default flow:

normalize Vietnamese text with soe-vinorm, then clean spaces around punctuation;
split long text into sentences;
for a 1-word sentence: use at least 24 steps and speed=0.6;
for a 2-4 word sentence: use speed=0.8;
generate each segment separately;
merge segments with silence, crossfade, fade in, and fade out.

Useful knobs:

--no-vietnamese-normalize
--no-split-sentences
--crossfade-ms 80
--silence-ms 180
--fade-in-ms 20
--fade-out-ms 80

Files

checkpoint-920000.pt: latest FP16 checkpoint
checkpoint-700000.pt: previous FP16 checkpoint used for the current demo audios
config.json, model.json: model config
tokens.txt: Vietnamese character tokenizer
audio/: 30 reference audios plus .txt transcripts
demo/: regenerated audio demos and metadata.json
vizipvoice.py: wrapper mirrored from GitHub

Responsible Use

This model can clone voices from short audio prompts. Use only voices you own or have explicit permission to use. Do not use it for impersonation, fraud, harassment, misinformation, or other harmful content.

License

Apache License 2.0. Please also credit the original ZipVoice project.

Downloads last month: 615

Model tree for contextboxai/ViZipvoice

Base model

k2-fsa/ZipVoice

Finetuned

(2)

this model

contextboxai
/

ViZipvoice