🦜 VieNeu-TTS v3 Turbo

Overview

VieNeu-TTS v3 Turbo is the next generation of Vietnamese TTS — 48 kHz high-fidelity speech, instant voice cloning, built-in multi-speaker default voices, inline emotion cues, and seamless bilingual (En–Vi) code-switching. It is a pure-PyTorch engine running on both GPU and CPU, using the MOSS-Audio-Tokenizer-Nano codec.

Early access. v3 Turbo is released for preview. It is fast and natural, but some features (notably the emotion cues) are still experimental. The full v3 release is coming in the next few weeks.

What's new in v3:

48 kHz audio — a big jump in fidelity over v2 (24 kHz).

Built-in default voices — each default speaker is addressed by a dedicated speaker token + fixed reference, so the voice is stable and consistent with no reference clip needed.

Emotion / non-verbal cues (experimental) — drop [cười], [thở dài], [hắng giọng] straight into your text.

Batched generation — synthesize many chunks at once (batch size up to 32), including a multi-speaker conversation mode that batches the whole script regardless of speaker.

Instant Voice Cloning — still clones a voice from just 3–5 seconds of audio (cloning is available from v3 onward; v1/v2 do not support it).

🏗️ Architecture & Credits

The VieNeu-TTS v3 Turbo architecture is an original design by the author, Phạm Nguyễn Ngọc Bảo, and is trained from scratch on ~10,000 hours of English–Vietnamese speech — it is not a fine-tune, distillation, or adaptation of any existing TTS model.

Model architecture & training: designed and trained from scratch by Phạm Nguyễn Ngọc Bảo — https://github.com/pnnbao97
Audio codec: MOSS-Audio-Tokenizer-Nano (OpenMOSS-Team) — 48 kHz neural audio codec.
Phonemizer: sea-g2p — fast Vietnamese/English grapheme-to-phoneme, also by the author.

Tác giả: Phạm Nguyễn Ngọc Bảo

☕ Support This Project

Training high-quality TTS models requires significant GPU resources. If you find this model useful, please consider supporting the development:

🔥 Quick Start (Web UI)

git clone https://github.com/pnnbao97/VieNeu-TTS.git
cd VieNeu-TTS

Option 1: CPU (minimal, torch-free) — runs v3 Turbo via ONNX

uv sync

Option 2: GPU — v3 Turbo (PyTorch) + VieNeu-TTS v2 (GPU)

uv sync --group gpu

Start the Web UI:

uv run vieneu-web

In the Web UI, pick "VieNeu-TTS-v3-Turbo (Thử nghiệm)" as the backbone. You get a Default voice tab, a Voice Cloning tab, and a Conversation tab (batched multi-speaker podcasts).

📦 Using Python SDK (vieneu)

pip install vieneu

Full Features Guide

from vieneu import Vieneu
from time import time

# Default = v3 Turbo. CPU → ONNX (torch-free); GPU → PyTorch (auto-detected).
tts = Vieneu()

text = f"""[cười] Trời ơi, cái giọng nó tự nhiên mà nó mượt mà dã man, nghe không khác gì người thật luôn. Giờ thì tha hồ mà quẩy content với cả kho giọng nói đa dạng, đủ mọi sắc thái biểu cảm. Mọi người bật loa lên rồi cùng trải nghiệm thử với mình nhé!"""

start_time = time()
# 1. Default voice (Bình An) — 48 kHz, no reference needed
audio = tts.infer(text)
tts.save(audio, "output.wav")
end_time = time()
print(f"Time taken: {end_time - start_time} seconds")
# 2. Built-in voices by name
for label, voice_id in tts.list_preset_voices():
    print(label, voice_id)
audio = tts.infer("Mình là Xuân Vĩnh nè!", voice="Xuân Vĩnh")
tts.save(audio, "output_Xuân Vĩnh.wav")
# # 3. Emotion / non-verbal cues — EXPERIMENTAL: [cười] [thở dài] [hắng giọng]
# audio = tts.infer("Nghe hay quá đi [cười]. Để mình nói tiếp [hắng giọng].", voice="Ngọc Linh")

# # 4. Instant voice cloning from a 3–5s reference clip
# audio = tts.infer("Đây là giọng được nhân bản tức thì.", ref_audio="my_voice.wav")

A temperature around 0.8 gives the most stable result for v3 Turbo. Higher values add expressiveness but can be less stable.

🎭 Default Voices

Built-in voices — call them by name via voice="<name>", no reference audio required.

Voice	Gender	Style
Ngọc Lan (default)	Female	Soft / gentle
Ngọc Linh	Female	Bright
Trúc Ly	Female	Youthful
Mỹ Duyên	Female	Smooth
Xuân Vĩnh	Male	Upbeat
Thái Sơn	Male	Firm
Gia Bảo	Male	Smooth
Đức Trí	Male	Clear
Trọng Hữu	Male	Knowledgeable
Bình An	Male	Even / calm

For any other voice, use Voice Cloning with a short reference clip (ref_audio="...").

🔬 Model Variants

Model	Format	Device	Sample Rate	Quality	Features
VieNeu-TTS-v3-Turbo	PyTorch	GPU/CPU	48 kHz	⭐⭐⭐⭐⭐	Default voices, Cloning, Emotion cues
VieNeu-TTS-v2	PyTorch	GPU/CPU	24 kHz	⭐⭐⭐⭐⭐	Podcast, En-Vi code-switching
VieNeu-TTS-v2 (GGUF)	GGUF Q4	CPU	24 kHz	⭐⭐⭐⭐	Fastest on CPU, Podcast
VieNeu-TTS-v1	PyTorch	GPU	24 kHz	⭐⭐⭐⭐	Stable (Vi only)

📑 Citation

@misc{vieneutts2026,
  title        = {VieNeu-TTS v3 Turbo: 48kHz Vietnamese Text-to-Speech with Instant Voice Cloning and Emotion Control},
  author       = {Pham Nguyen Ngoc Bao},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/pnnbao-ump/VieNeu-TTS-v3-Turbo}}
}

Made with ❤️ for the Vietnamese TTS community

Downloads last month: 10,394

Safetensors

Model size

0.1B params

Tensor type

BF16

Space using pnnbao-ump/VieNeu-TTS-v3-Turbo 1

Collection including pnnbao-ump/VieNeu-TTS-v3-Turbo

VieNeu-TTS-v3

Collection

2 items • Updated 6 days ago