Confucius4-TTS: a Multilingual and Cross-Lingual Zero-Shot TTS Engine
One voice. Any language.
Confucius4-TTS is an advanced LLM-based text-to-speech (TTS) system designed for multilingual and cross-lingual speech synthesis. Built on a speech encoder + large language model (LLM) architecture, Confucius4-TTS enables high-quality speech generation while preserving speaker identity across languages. You can try our online demo at https://confucius4-tts.youdao.com/gradio.
✨ Key Features
- 14 Languages Supported: Chinese, English, Japanese, Korean, German, French, Spanish, Indonesian, Italian, Thai, Portuguese, Russian, Malay and Vietnamese (more coming soon)
- Unconstrained Voice Cloning: No reference transcript required
- Cross-Lingual Voice Transfer: Unaccented speech synthesis across 14 languages
- Zero-Shot Voice Transfer: Clone voices without additional training
- Seamless Emotion Transfer: Clone the feeling, not just the voice
- Robust Generalization: Stable performance in real-world multilingual scenarios
With strong cross-lingual generalization, Confucius4-TTS allows users to seamlessly switch languages while keeping the same voice, delivering fluent, natural, and expressive speech.
Contents
🛠 Installation
Requirements
- Python 3.10
- CUDA 12.6
Setup
- Clone the repository:
git clone https://github.com/netease-youdao/Confucius4-TTS.git
cd Confucius4-TTS
- Create and activate a conda environment:
conda create -n confuciustts python=3.10 -y
conda activate confuciustts
- Install dependencies:
pip install -r requirements.txt
🚀 Inference
Use the provided example.py script for zero-shot TTS synthesis:
python example.py \
--prompt_wav path/to/reference.wav \
--text "Your text to synthesize" \
--lang en \
--out output.wav \
--config config/inference_config.yaml
You can also use the Python API directly:
import torch
import torchaudio
from confuciustts.cli.inference import ConfuciusTTS
model = ConfuciusTTS(
config_path="config/inference_config.yaml",
device="cuda" if torch.cuda.is_available() else "cpu",
)
audio = model.generate(
text="Hello, welcome to Confucius4-TTS.",
lang="en",
prompt_wav="path/to/reference.wav",
verbose=True,
)
torchaudio.save("output.wav", audio.cpu(), model.sample_rate)
🚀 Fine-Tuning
Confucius4-TTS follows a "speech encoder + LLM" architecture. The training pipeline covers two modules:
- Text2Semantic (T2S): generates semantic token sequences from text and speaker conditioning.
- Semantic2Acoustic (S2A): a flow-matching model that converts semantic tokens into mel spectrograms.
1. Prepare Pretrained Models
Download the two external models:
# Wav2Vec2-BERT (speaker conditioning & semantic feature extraction)
huggingface-cli download facebook/w2v-bert-2.0 \
--local-dir pretrained/w2v-bert-2.0
# Amphion MaskGCT (semantic codec implementation)
git clone https://github.com/open-mmlab/Amphion.git external/Amphion
After downloading, your directory should look like:
checkpoints/
├── t2s_model.safetensors # pretrained T2S weights
├── s2a_model.pt # pretrained S2A weights
├── wav2vec2bert_stats.pt # semantic feature normalization statistics
├── special_tokens_map.json # tokenizer files
├── tokenizer.json
├── tokenizer.model
└── tokenizer_config.json
pretrained/
├── w2v-bert-2.0/ # Wav2Vec2-BERT model
└── campplus/
└── campplus_cn_common.bin # CAMPPlus speaker encoder checkpoint
external/
└── Amphion/ # MaskGCT semantic codec implementation
2. Prepare Training Data
Training data is provided as TSV files (tab-separated, no header) with the following 5 columns:
| Column | Description |
|---|---|
lang |
Language code (e.g. zh, en, ja) |
wav_path |
Path to the target audio |
norm_text |
Normalized text |
semantic_ids_path |
Pre-extracted semantic tokens (.npy file path) |
ref_audio_paths |
Reference audio path(s), comma-separated for multiple |
Configure the train/validation paths in config/train_t2s.yaml:
data:
train_data_path:
- data/train.tsv
val_data_path:
- data/val.tsv
3. Launch T2S Training
Set the pretrained T2S checkpoint path in config/train_t2s.yaml:
paths:
t2s_checkpoint: checkpoints/t2s_model.safetensors
Single-node training:
python -m confuciustts.cli.train_t2s -c config/train_t2s.yaml
4. Launch S2A Training
Set the checkpoint paths in config/train_s2a.yaml. t2s_checkpoint points to the frozen T2S backbone; s2a_checkpoint is optional and can be used to resume from a pretrained S2A model:
paths:
t2s_checkpoint: checkpoints/t2s_model.safetensors
s2a_checkpoint: checkpoints/s2a_model.pt # optional: resume from pretrained S2A
Single-node training:
python -m confuciustts.cli.train_s2a -c config/train_s2a.yaml
During S2A training, the T2S model, speaker encoder (Wav2Vec2-BERT), and style encoder (CAMPPlus) are all frozen. Only the flow-matching S2A model is trained.
📊 Performance
Confucius4-TTS achieves competitive results on multilingual and cross-lingual zero-shot TTS benchmarks, with strong intelligibility and speaker similarity across multiple languages.
Lower is better for WER/CER (↓), and higher is better for SIM (↑).
CV3-eval Cross-lingual
CV3-eval Cross-lingual Results (click to expand)
| Direction | Metric | Confucius4-TTS | F5-TTS† | Spark-TTS | CosyVoice2† | CosyVoice3-0.5B† | CosyVoice3-0.5B + DiffRO† | CosyVoice3-1.5B† | CosyVoice3-1.5B + DiffRO† |
|---|---|---|---|---|---|---|---|---|---|
| en→zh | WER↓ | 6.71 | 11.60 | 12.40 | 13.50 | 8.48 | 5.16 | 8.01 | 5.09 |
| ja→zh | WER↓ | 4.93 | – | – | 48.10 | 6.86 | 3.22 | 6.78 | 3.05 |
| ko→zh | WER↓ | 1.46 | – | – | 7.70 | 5.24 | 1.03 | 3.30 | 1.06 |
| zh→en | WER↓ | 3.19 | 5.57 | 7.36 | 17.10 | 6.83 | 4.41 | 5.39 | 4.20 |
| ja→en | WER↓ | 3.44 | – | – | 11.20 | 5.86 | 4.78 | 5.94 | 4.19 |
| ko→en | WER↓ | 3.42 | – | – | 13.10 | 18.30 | 7.91 | 13.70 | 7.08 |
† Requires reference text.
X-Voice Benchmark
X-Voice Cross-lingual Results (click to expand)
| Direction | Metric | Confucius4-TTS | X-Voice | OmniVoice† | IndexTTS2 |
|---|---|---|---|---|---|
| de→zh | WER↓ | 2.86 | 3.07 | 13.10 | 3.46 |
| SIM↑ | 0.569 | 0.516 | 0.691 | 0.544 | |
| en→zh | WER↓ | 3.27 | 3.06 | 4.03 | 3.78 |
| SIM↑ | 0.504 | 0.443 | 0.544 | 0.485 | |
| fr→zh | WER↓ | 2.74 | 3.01 | 18.10 | 3.53 |
| SIM↑ | 0.550 | 0.518 | 0.686 | 0.543 | |
| ja→zh | WER↓ | 3.50 | 3.39 | 79.10 | 4.11 |
| SIM↑ | 0.637 | 0.629 | 0.709 | 0.650 | |
| ko→zh | WER↓ | 2.86 | 3.13 | 11.88 | 2.90 |
| SIM↑ | 0.649 | 0.655 | 0.718 | 0.650 | |
| th→zh | WER↓ | 2.87 | 2.79 | 3.30 | 3.08 |
| SIM↑ | 0.623 | 0.614 | 0.661 | 0.622 | |
| vi→zh | WER↓ | 2.75 | 2.78 | 10.51 | 2.98 |
| SIM↑ | 0.640 | 0.641 | 0.701 | 0.641 |
† Requires reference text.
Seed-TTS-eval
Seed-TTS-eval English & Chinese Zero-shot Results (click to expand)
| Language | Metric | Confucius4-TTS | Qwen3-TTS | FishAudio S2† | OmniVoice† | VoxCPM2† | X-Voice |
|---|---|---|---|---|---|---|---|
| English | WER↓ | 1.49 | 1.24 | 0.99 | 1.60 | 1.84 | 1.91 |
| SIM↑ | 0.70 | 0.714 | – | 0.741 | 0.753 | 0.627 | |
| Chinese | CER↓ | 0.94 | 0.77 | 0.54 | 0.84 | 0.97 | 1.47 |
| SIM↑ | 0.765 | 0.770 | – | 0.777 | 0.795 | 0.746 |
† Requires reference text.
MiniMax-Multilingual-Test
MiniMax-Multilingual-Test Results (click to expand)
| Language | Metric | Confucius4-TTS | ElevenLab | Qwen3-TTS | FishAudio S2† | OmniVoice† | VoxCPM2† | X-Voice |
|---|---|---|---|---|---|---|---|---|
| German | WER↓ | 0.47 | 0.57 | 1.24 | 0.55 | 0.96 | 0.68 | 2.00 |
| SIM↑ | 0.775 | 0.614 | 0.768 | 0.767 | 0.812 | 0.803 | 0.763 | |
| French | WER↓ | 3.66 | 5.22 | 2.86 | 3.05 | 3.35 | 4.53 | 4.73 |
| SIM↑ | 0.723 | 0.535 | 0.716 | 0.698 | 0.801 | 0.735 | 0.746 | |
| Indonesian | WER↓ | 1.12 | 1.06 | – | 1.46 | 1.97 | 1.08 | 1.47 |
| SIM↑ | 0.765 | 0.660 | – | 0.763 | 0.805 | 0.800 | 0.725 | |
| Korean | WER↓ | 1.84 | 1.87 | 1.76 | 1.18 | 2.65 | 1.96 | 2.27 |
| SIM↑ | 0.812 | 0.700 | 0.790 | 0.817 | 0.828 | 0.833 | 0.788 | |
| Thai | WER↓ | 1.56 | 73.94 | – | 4.23 | 3.98 | 2.96 | 4.71 |
| SIM↑ | 0.773 | 0.588 | – | 0.786 | 0.841 | 0.840 | 0.791 | |
| Japanese | WER↓ | 4.14 | 10.65 | 3.82 | 2.76 | 4.03 | 4.63 | 7.13 |
| SIM↑ | 0.788 | 0.738 | 0.771 | 0.796 | 0.828 | 0.828 | 0.765 | |
| Vietnamese | WER↓ | 1.61 | 73.42 | – | 7.41 | 1.37 | 3.31 | 1.40 |
| SIM↑ | 0.751 | 0.369 | – | 0.740 | 0.805 | 0.806 | 0.672 | |
| Italian | WER↓ | 1.30 | 1.74 | 0.95 | 1.27 | 2.07 | 1.56 | 2.27 |
| SIM↑ | 0.787 | 0.579 | 0.752 | 0.747 | 0.812 | 0.780 | 0.780 | |
| Portuguese | WER↓ | 2.48 | 1.33 | 1.53 | 1.14 | 2.51 | 1.94 | 2.61 |
| SIM↑ | 0.796 | 0.711 | 0.805 | 0.781 | 0.859 | 0.837 | 0.794 | |
| Spanish | WER↓ | 1.02 | 1.08 | 1.13 | 0.91 | 1.03 | 1.44 | 2.91 |
| SIM↑ | 0.778 | 0.615 | 0.814 | 0.776 | 0.804 | 0.831 | 0.747 | |
| Russian | WER↓ | 4.64 | 3.88 | 3.21 | 2.40 | 2.23 | 3.63 | 6.49 |
| SIM↑ | 0.787 | 0.675 | 0.784 | 0.790 | 0.783 | 0.811 | 0.799 |
† Requires reference text.
Acknowledgements
Confucius4-TTS builds on the following open-source projects:
- Qwen3-TTS — Speaker encoder (ECAPA-TDNN) and text embedding projector architectures
- CosyVoice — Text normalization pipeline
- Amphion / MaskGCT — Semantic codec implementation
- w2v-BERT 2.0 — Semantic feature extraction and speaker conditioning
- Seed-VC — Flow matching architecture reference
- BigVGAN — High-fidelity neural vocoder for mel-spectrogram to waveform synthesis
Citation
If you find Confucius4-TTS useful in your research or project, please consider citing:
@misc{confucius4tts_2026,
title = {Confucius4-TTS: A Multilingual and Cross-Lingual Zero-Shot TTS Engine},
author = {{NetEase Youdao}},
year = {2026},
howpublished = {\url{https://github.com/netease-youdao/Confucius4-TTS}},
note = {GitHub repository}
}