🦜 VieNeu-TTS v3 Turbo
Overview
VieNeu-TTS v3 Turbo is the next generation of Vietnamese TTS — 48 kHz high-fidelity speech, instant voice cloning, built-in multi-speaker default voices, inline emotion cues, and seamless bilingual (En–Vi) code-switching. It is a pure-PyTorch engine running on both GPU and CPU, using the MOSS-Audio-Tokenizer-Nano codec.
Early access. v3 Turbo is released for preview. It is fast and natural, but some features (notably the emotion cues) are still experimental. The full v3 release is coming in the next few weeks.
What's new in v3:
- 48 kHz audio — a big jump in fidelity over v2 (24 kHz).
- Built-in default voices — each default speaker is addressed by a dedicated speaker token + fixed reference, so the voice is stable and consistent with no reference clip needed.
- Emotion / non-verbal cues (experimental) — drop
[cười],[thở dài],[hắng giọng]straight into your text.- Batched generation — synthesize many chunks at once (batch size up to 32), including a multi-speaker conversation mode that batches the whole script regardless of speaker.
- Instant Voice Cloning — still clones a voice from just 3–5 seconds of audio (cloning is available from v3 onward; v1/v2 do not support it).
🏗️ Architecture & Credits
The VieNeu-TTS v3 Turbo architecture is an original design by the author, Phạm Nguyễn Ngọc Bảo, and is trained from scratch on ~10,000 hours of English–Vietnamese speech — it is not a fine-tune, distillation, or adaptation of any existing TTS model.
- Model architecture & training: designed and trained from scratch by Phạm Nguyễn Ngọc Bảo — https://github.com/pnnbao97
- Audio codec: MOSS-Audio-Tokenizer-Nano (OpenMOSS-Team) — 48 kHz neural audio codec.
- Phonemizer: sea-g2p — fast Vietnamese/English grapheme-to-phoneme, also by the author.
Tác giả: Phạm Nguyễn Ngọc Bảo
☕ Support This Project
Training high-quality TTS models requires significant GPU resources. If you find this model useful, please consider supporting the development:
🔥 Quick Start (Web UI)
git clone https://github.com/pnnbao97/VieNeu-TTS.git
cd VieNeu-TTS
- Option 1: CPU (minimal, torch-free) — runs v3 Turbo via ONNX
uv sync
- Option 2: GPU — v3 Turbo (PyTorch) + VieNeu-TTS v2 (GPU)
uv sync --group gpu
Start the Web UI:
uv run vieneu-web
In the Web UI, pick "VieNeu-TTS-v3-Turbo (Thử nghiệm)" as the backbone. You get a Default voice tab, a Voice Cloning tab, and a Conversation tab (batched multi-speaker podcasts).
📦 Using Python SDK (vieneu)
pip install vieneu
Full Features Guide
from vieneu import Vieneu
from time import time
# Default = v3 Turbo. CPU → ONNX (torch-free); GPU → PyTorch (auto-detected).
tts = Vieneu()
text = f"""[cười] Trời ơi, cái giọng nó tự nhiên mà nó mượt mà dã man, nghe không khác gì người thật luôn. Giờ thì tha hồ mà quẩy content với cả kho giọng nói đa dạng, đủ mọi sắc thái biểu cảm. Mọi người bật loa lên rồi cùng trải nghiệm thử với mình nhé!"""
start_time = time()
# 1. Default voice (Bình An) — 48 kHz, no reference needed
audio = tts.infer(text)
tts.save(audio, "output.wav")
end_time = time()
print(f"Time taken: {end_time - start_time} seconds")
# 2. Built-in voices by name
for label, voice_id in tts.list_preset_voices():
print(label, voice_id)
audio = tts.infer("Mình là Xuân Vĩnh nè!", voice="Xuân Vĩnh")
tts.save(audio, "output_Xuân Vĩnh.wav")
# # 3. Emotion / non-verbal cues — EXPERIMENTAL: [cười] [thở dài] [hắng giọng]
# audio = tts.infer("Nghe hay quá đi [cười]. Để mình nói tiếp [hắng giọng].", voice="Ngọc Linh")
# # 4. Instant voice cloning from a 3–5s reference clip
# audio = tts.infer("Đây là giọng được nhân bản tức thì.", ref_audio="my_voice.wav")
A temperature around 0.8 gives the most stable result for v3 Turbo. Higher values add expressiveness but can be less stable.
🎭 Default Voices
Built-in voices — call them by name via voice="<name>", no reference audio required.
| Voice | Gender | Style |
|---|---|---|
| Ngọc Lan (default) | Female | Soft / gentle |
| Ngọc Linh | Female | Bright |
| Trúc Ly | Female | Youthful |
| Mỹ Duyên | Female | Smooth |
| Xuân Vĩnh | Male | Upbeat |
| Thái Sơn | Male | Firm |
| Gia Bảo | Male | Smooth |
| Đức Trí | Male | Clear |
| Trọng Hữu | Male | Knowledgeable |
| Bình An | Male | Even / calm |
For any other voice, use Voice Cloning with a short reference clip (ref_audio="...").
🔬 Model Variants
| Model | Format | Device | Sample Rate | Quality | Features |
|---|---|---|---|---|---|
| VieNeu-TTS-v3-Turbo | PyTorch | GPU/CPU | 48 kHz | ⭐⭐⭐⭐⭐ | Default voices, Cloning, Emotion cues |
| VieNeu-TTS-v2 | PyTorch | GPU/CPU | 24 kHz | ⭐⭐⭐⭐⭐ | Podcast, En-Vi code-switching |
| VieNeu-TTS-v2 (GGUF) | GGUF Q4 | CPU | 24 kHz | ⭐⭐⭐⭐ | Fastest on CPU, Podcast |
| VieNeu-TTS-v1 | PyTorch | GPU | 24 kHz | ⭐⭐⭐⭐ | Stable (Vi only) |
📑 Citation
@misc{vieneutts2026,
title = {VieNeu-TTS v3 Turbo: 48kHz Vietnamese Text-to-Speech with Instant Voice Cloning and Emotion Control},
author = {Pham Nguyen Ngoc Bao},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/pnnbao-ump/VieNeu-TTS-v3-Turbo}}
}
Made with ❤️ for the Vietnamese TTS community
- Downloads last month
- 10,394