ViZipVoice
Vietnamese zero-shot TTS / voice cloning fine-tuned from ZipVoice.
- GitHub: https://github.com/iamdinhthuan/ViZipvoice
- Model repo: https://huggingface.co/contextboxai/ViZipvoice
- Space: https://huggingface.co/spaces/dinhthuan/ViZipvoice
- Latest checkpoint:
checkpoint-920000.pt, FP16 inference state dict - Training data: about
7000total hours, including roughly6500hours of Vietnamese and500hours of English - Tokenizer:
SimpleTokenizer, character-level,244tokens - Sample rate:
24 kHz - Default vocoder:
charactr/vocos-mel-24khz
The wrapper loads the largest checkpoint-<step>.pt automatically and uses soe-vinorm for Vietnamese text normalization.
Audio Demo
Generated with checkpoint-700000.pt, the current wrapper flow, and the demo text in demo/demo_text.txt.
Đinh-Quyết
Nhã-Uyên
MC
Install
git clone https://github.com/iamdinhthuan/ViZipvoice.git
cd ViZipvoice
pip install -r requirements.txt
export PYTHONPATH="$PWD:$PYTHONPATH"
CLI
python3 -m zipvoice.bin.infer_vizipvoice \
--prompt-wav prompt.wav \
--prompt-text "Xin chào, đây là giọng mẫu của tôi." \
--text "ViZipVoice có thể tổng hợp giọng nói tiếng Việt từ một đoạn mẫu ngắn." \
--res-wav-path output.wav
The CLI downloads this model repo by default. Use --model-dir models/ViZipvoice after downloading files locally.
Python
from zipvoice.vizipvoice import ViZipVoiceTTS
tts = ViZipVoiceTTS()
metrics = tts.synthesize(
prompt_wav="prompt.wav",
prompt_text="Xin chào, đây là giọng mẫu của tôi.",
text="Đây là câu tiếng Việt được sinh bởi ViZipVoice.",
output_path="output.wav",
)
print(metrics)
Reference Audio
audio/ contains 30 reference prompts. Each audio file has a sidecar .txt transcript with the same basename:
audio/Đinh-Quyết.mp3
audio/Đinh-Quyết.txt
Names only keep the audio/person name; the original lar_* prefix and Pro suffix are removed. The Gradio app reads this sidecar format automatically.
huggingface-cli download contextboxai/ViZipvoice \
--local-dir models/ViZipvoice \
--local-dir-use-symlinks False
python3 egs/zipvoice/gradio_app.py --exp-dir models/ViZipvoice
Inference Flow
The CLI, Python wrapper, and Gradio app use the same default flow:
- normalize Vietnamese text with
soe-vinorm, then clean spaces around punctuation; - split long text into sentences;
- for a
1-word sentence: use at least24steps andspeed=0.6; - for a
2-4word sentence: usespeed=0.8; - generate each segment separately;
- merge segments with silence, crossfade, fade in, and fade out.
Useful knobs:
--no-vietnamese-normalize
--no-split-sentences
--crossfade-ms 80
--silence-ms 180
--fade-in-ms 20
--fade-out-ms 80
Files
checkpoint-920000.pt: latest FP16 checkpointcheckpoint-700000.pt: previous FP16 checkpoint used for the current demo audiosconfig.json,model.json: model configtokens.txt: Vietnamese character tokenizeraudio/: 30 reference audios plus.txttranscriptsdemo/: regenerated audio demos andmetadata.jsonvizipvoice.py: wrapper mirrored from GitHub
Responsible Use
This model can clone voices from short audio prompts. Use only voices you own or have explicit permission to use. Do not use it for impersonation, fraud, harassment, misinformation, or other harmful content.
License
Apache License 2.0. Please also credit the original ZipVoice project.
- Downloads last month
- 615
Model tree for contextboxai/ViZipvoice
Base model
k2-fsa/ZipVoice