🦜 VieNeu-TTS v3 Turbo

GitHub Model Discord

Overview

VieNeu-TTS v3 Turbo is the next generation of Vietnamese TTS — 48 kHz high-fidelity speech, instant voice cloning, built-in multi-speaker default voices, inline emotion cues, and seamless bilingual (En–Vi) code-switching. It is a pure-PyTorch engine running on both GPU and CPU, using the MOSS-Audio-Tokenizer-Nano codec.

Early access. v3 Turbo is released for preview. It is fast and natural, but some features (notably the emotion cues) are still experimental. The full v3 release is coming in the next few weeks.

What's new in v3:

  • 48 kHz audio — a big jump in fidelity over v2 (24 kHz).
  • Built-in default voices — each default speaker is addressed by a dedicated speaker token + fixed reference, so the voice is stable and consistent with no reference clip needed.
  • Emotion / non-verbal cues (experimental) — drop [cười], [thở dài], [hắng giọng] straight into your text.
  • Batched generation — synthesize many chunks at once (batch size up to 32), including a multi-speaker conversation mode that batches the whole script regardless of speaker.
  • Instant Voice Cloning — still clones a voice from just 3–5 seconds of audio (cloning is available from v3 onward; v1/v2 do not support it).

🏗️ Architecture & Credits

The VieNeu-TTS v3 Turbo architecture is an original design by the author, Phạm Nguyễn Ngọc Bảo, and is trained from scratch on ~10,000 hours of English–Vietnamese speech — it is not a fine-tune, distillation, or adaptation of any existing TTS model.

  • Model architecture & training: designed and trained from scratch by Phạm Nguyễn Ngọc Bảohttps://github.com/pnnbao97
  • Audio codec: MOSS-Audio-Tokenizer-Nano (OpenMOSS-Team) — 48 kHz neural audio codec.
  • Phonemizer: sea-g2p — fast Vietnamese/English grapheme-to-phoneme, also by the author.

Tác giả: Phạm Nguyễn Ngọc Bảo

☕ Support This Project

Training high-quality TTS models requires significant GPU resources. If you find this model useful, please consider supporting the development:

Buy Me a Coffee


🔥 Quick Start (Web UI)

git clone https://github.com/pnnbao97/VieNeu-TTS.git
cd VieNeu-TTS
  • Option 1: CPU (minimal, torch-free) — runs v3 Turbo via ONNX
uv sync
  • Option 2: GPUv3 Turbo (PyTorch) + VieNeu-TTS v2 (GPU)
uv sync --group gpu

Start the Web UI:

uv run vieneu-web

In the Web UI, pick "VieNeu-TTS-v3-Turbo (Thử nghiệm)" as the backbone. You get a Default voice tab, a Voice Cloning tab, and a Conversation tab (batched multi-speaker podcasts).


📦 Using Python SDK (vieneu)

pip install vieneu

Full Features Guide

from vieneu import Vieneu
from time import time

# Default = v3 Turbo. CPU → ONNX (torch-free); GPU → PyTorch (auto-detected).
tts = Vieneu()

text = f"""[cười] Trời ơi, cái giọng nó tự nhiên mà nó mượt mà dã man, nghe không khác gì người thật luôn. Giờ thì tha hồ mà quẩy content với cả kho giọng nói đa dạng, đủ mọi sắc thái biểu cảm. Mọi người bật loa lên rồi cùng trải nghiệm thử với mình nhé!"""

start_time = time()
# 1. Default voice (Bình An) — 48 kHz, no reference needed
audio = tts.infer(text)
tts.save(audio, "output.wav")
end_time = time()
print(f"Time taken: {end_time - start_time} seconds")
# 2. Built-in voices by name
for label, voice_id in tts.list_preset_voices():
    print(label, voice_id)
audio = tts.infer("Mình là Xuân Vĩnh nè!", voice="Xuân Vĩnh")
tts.save(audio, "output_Xuân Vĩnh.wav")
# # 3. Emotion / non-verbal cues — EXPERIMENTAL: [cười] [thở dài] [hắng giọng]
# audio = tts.infer("Nghe hay quá đi [cười]. Để mình nói tiếp [hắng giọng].", voice="Ngọc Linh")

# # 4. Instant voice cloning from a 3–5s reference clip
# audio = tts.infer("Đây là giọng được nhân bản tức thì.", ref_audio="my_voice.wav")

A temperature around 0.8 gives the most stable result for v3 Turbo. Higher values add expressiveness but can be less stable.


🎭 Default Voices

Built-in voices — call them by name via voice="<name>", no reference audio required.

Voice Gender Style
Ngọc Lan (default) Female Soft / gentle
Ngọc Linh Female Bright
Trúc Ly Female Youthful
Mỹ Duyên Female Smooth
Xuân Vĩnh Male Upbeat
Thái Sơn Male Firm
Gia Bảo Male Smooth
Đức Trí Male Clear
Trọng Hữu Male Knowledgeable
Bình An Male Even / calm

For any other voice, use Voice Cloning with a short reference clip (ref_audio="...").


🔬 Model Variants

Model Format Device Sample Rate Quality Features
VieNeu-TTS-v3-Turbo PyTorch GPU/CPU 48 kHz ⭐⭐⭐⭐⭐ Default voices, Cloning, Emotion cues
VieNeu-TTS-v2 PyTorch GPU/CPU 24 kHz ⭐⭐⭐⭐⭐ Podcast, En-Vi code-switching
VieNeu-TTS-v2 (GGUF) GGUF Q4 CPU 24 kHz ⭐⭐⭐⭐ Fastest on CPU, Podcast
VieNeu-TTS-v1 PyTorch GPU 24 kHz ⭐⭐⭐⭐ Stable (Vi only)

📑 Citation

@misc{vieneutts2026,
  title        = {VieNeu-TTS v3 Turbo: 48kHz Vietnamese Text-to-Speech with Instant Voice Cloning and Emotion Control},
  author       = {Pham Nguyen Ngoc Bao},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/pnnbao-ump/VieNeu-TTS-v3-Turbo}}
}

Made with ❤️ for the Vietnamese TTS community

Downloads last month
10,394
Safetensors
Model size
0.1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using pnnbao-ump/VieNeu-TTS-v3-Turbo 1

Collection including pnnbao-ump/VieNeu-TTS-v3-Turbo