PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis
English | ไธญๆ
๐ Paper | ๐ค HuggingFace | ๐ค ModelScope | ๐ง Demos
News ๐
- [2025.05] Release Pilot-TTS base and instruct model weights
Highlight ๐ฅ
PilotTTS is an LLM-based text-to-speech (TTS) system that builds an intentionally simplified architecture with fully open-source components and achieves competitive performance through rigorous data engineering.
Key Features
- A fully open-source data processing pipeline: We design a multi-stage pipeline that incorporates quality assessment and enhancement, annotation, and quality filtering, where all operators are implemented using publicly available tools. This pipeline converts large-scale Internet audio into clean training data with rich annotation, achieving high-quality data generation while substantially reducing costs.
- Content Consistency and Speaker Similarity Control: On the Seed-TTS test set, our model achieves state-of-the-art speaker similarity (0.862) and highly competitive content accuracy (CER 0.87%).
- Emotion and Paralinguistic Control: Supports controllable synthesis for 11 emotion categories (Happy, Sad, Fear, Angry, Contempt, Serious, Surprise, Blue, Concern, Disgust, Psychology) and 4 paralinguistic categories (LAUGH, BREATH, CRY, COUGH).
- Dialect Control: Supports 14 Chinese dialects and enables cross-dialect synthesis, with particular strength in synthesizing from Mandarin Chinese to the target dialect.
Installation โ๏ธ
Clone and install
git clone https://github.com/xxx/pilot-tts.git
cd pilot-tts
Environment setup
conda create -n pilot-tts python=3.10 -y
conda activate pilot-tts
pip install -r requirements.txt
Model download
1. Pilot-TTS models (our weights)
# ModelScope
from modelscope import snapshot_download
snapshot_download('xxx/Pilot-TTS', local_dir='pretrained_models/')
# HuggingFace
from huggingface_hub import snapshot_download
snapshot_download('xxx/Pilot-TTS', local_dir='pretrained_models/')
This includes: pilot_tts.pt, pilot_tts_instruct.pt, and tokenizer/.
2. Third-party open-source models
Download the following dependencies from their respective open-source projects:
from modelscope import snapshot_download
# Qwen3-0.6B (LLM backbone)
snapshot_download('Qwen/Qwen3-0.6B', local_dir='pretrained_models/Qwen3-0.6B')
# CosyVoice3 (flow-matching vocoder, includes campplus.onnx)
snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/CosyVoice3-0.5B')
from huggingface_hub import snapshot_download
# w2v-bert-2.0 (audio feature extractor)
snapshot_download('facebook/w2v-bert-2.0', local_dir='pretrained_models/w2v-bert-2.0')
Note:
wav2vec2bert_stats.pt(from MaskGCT) is included in the Pilot-TTS model package.
Final directory structure
pretrained_models/
โโโ pilot_tts.pt # Base model (zero-shot voice cloning)
โโโ pilot_tts_instruct.pt # Instruct model (emotion, paralanguage, dialect)
โโโ Qwen3-0.6B/ # LLM backbone (from Qwen)
โโโ w2v-bert-2.0/ # Audio feature extractor (from Meta)
โโโ wav2vec2bert_stats.pt # Feature normalization stats (from MaskGCT)
โโโ CosyVoice3-0.5B/ # Flow-matching vocoder (from FunAudioLLM)
Quick Start ๐
Run all inference demos with a single command:
python demo.py
Inference
Python API
from demo import load_engine, synthesize
# Zero-shot voice cloning (base model)
engine = load_engine(
config_path="configs/infer_pilot_tts.yaml",
checkpoint="pretrained_models/pilot_tts.pt",
)
synthesize(engine, text="ไฝ ๅฅฝ๏ผไธ็๏ผ",
prompt_wav="assert/prompt.wav",
output_path="output/clone.wav")
# Load instruct model (emotion, paralanguage, dialect)
engine_instruct = load_engine(
config_path="configs/infer_pilot_tts_instruct.yaml",
checkpoint="pretrained_models/pilot_tts_instruct.pt",
)
# Emotion synthesis
synthesize(engine_instruct, text="ไปๅคฉๅคฉๆฐ็ๅฅฝๅ๏ผ",
prompt_wav="assert/prompt.wav",
emotion="happy", output_path="output/happy.wav")
# Paralanguage
synthesize(engine_instruct, text="่ฟๅคชๅฅฝ็ฌไบ<|LAUGH|>ๅไธไธๆฅ",
prompt_wav="assert/prompt.wav",
output_path="output/laugh.wav")
# Dialect (Henan)
synthesize(engine_instruct, text="ไธญไธไธญๅ๏ผๅฑไฟฉไธๅๅฟๅปๅ่ก่พฃๆฑคๅง",
prompt_wav="assert/prompt.wav",
language="zh-henan", output_path="output/henan.wav")
Command Line
# Zero-shot voice cloning (base model)
python inference.py \
--checkpoint pretrained_models/pilot_tts.pt \
--prompt-wav assert/prompt.wav \
--text "้่ฆๅๆ็็ฎๆ ๆๆฌ" \
--output output/zeroshot.wav
# Emotion synthesis (instruct model)
python inference.py \
--config configs/infer_pilot_tts_instruct.yaml \
--checkpoint pretrained_models/pilot_tts_instruct.pt \
--prompt-wav assert/prompt.wav \
--text "ไปๅคฉๅคฉๆฐ็ๅฅฝๅ๏ผๆไปฌๅปๅ
ฌๅญ็ฉๅง๏ผ" \
--emotion happy \
--output output/emotion.wav
# Paralanguage (instruct model)
python inference.py \
--config configs/infer_pilot_tts_instruct.yaml \
--checkpoint pretrained_models/pilot_tts_instruct.pt \
--prompt-wav assert/prompt.wav \
--text "่ฟไธช็ฌ่ฏๅคชๅฅฝ็ฌไบ<|LAUGH|>ๆ็็ๅฟไธไฝ" \
--output output/paralang.wav
# Dialect synthesis (instruct model)
python inference.py \
--config configs/infer_pilot_tts_instruct.yaml \
--checkpoint pretrained_models/pilot_tts_instruct.pt \
--prompt-wav assert/prompt.wav \
--text "ไธญไธไธญๅ๏ผๅฑไฟฉไธๅๅฟๅปๅ่ก่พฃๆฑคๅง" \
--language zh-henan \
--output output/dialect.wav
Supported Controls
| Feature | Usage | Model |
|---|---|---|
| Voice Cloning | Provide prompt audio | Both |
| Emotions | --emotion <tag> |
Instruct |
| Paralanguage | Insert tags in text | Instruct |
| Dialects | --language <dialect> |
Instruct |
Emotions:
| Tag | ๆ ๆ | Tag | ๆ ๆ |
|---|---|---|---|
happy |
ๅผๅฟ | sad |
ๆฒไผค |
angry |
ๆคๆ | surprise |
ๆ่ฎถ |
fear |
ๆๆง | disgust |
ๅๆถ |
serious |
ไธฅ่ | concern |
ๅ ณๅ |
blue |
ๅฟง้ | disdain |
่ฝป่ |
neutral |
ไธญๆง/ๅนณ้ | psychology |
ๅฟ็ๆดปๅจ |
unknown |
ไธๆๅฎๆ ๆ |
Paralanguage tags:
| Tag | Description |
|---|---|
<|LAUGH|> |
็ฌๅฃฐ |
<|BREATH|> |
ๅผๅธๅฃฐ |
<|COUGH|> |
ๅณๅฝ |
<|CRY|> |
ๅญๆณฃๅฃฐ |
<|LAUGH_SPAN|>...<|/LAUGH_SPAN|> |
ๅ ่ฃน็ฌๅฃฐๆๆฌ |
Dialects:
| Tag | ๆน่จ | Tag | ๆน่จ |
|---|---|---|---|
zh-dongbei |
ไธๅ่ฏ | zh-shandong |
ๅฑฑไธ่ฏ |
zh-henan |
ๆฒณๅ่ฏ | zh-shan1xi |
ๅฑฑ่ฅฟ่ฏ |
zh-minnan |
้ฝๅ่ฏญ | zh-gansu |
็่่ฏ |
zh-ningxia |
ๅฎๅค่ฏ | zh-shanghai |
ไธๆตท่ฏ |
zh-chongqing |
้ๅบ่ฏ | zh-hubei |
ๆนๅ่ฏ |
zh-hunan |
ๆนๅ่ฏ | zh-jiangxi |
ๆฑ่ฅฟ่ฏ |
zh-guizhou |
่ดตๅท่ฏ | zh-yunnan |
ไบๅ่ฏ |
WebUI
Launch a Gradio-based interactive interface:
python webui.py --port 9000
Project Structure
pilot-tts/
โโโ configs/ # Inference configurations (per checkpoint)
โโโ demo.py # Complete demo (all inference modes)
โโโ inference.py # CLI inference entry
โโโ webui.py # Gradio WebUI
โโโ asset/ # Example prompt audio
โโโ pilot_voice/ # Core model code
โ โโโ engine.py # InferenceEngine pipeline
โ โโโ model.py # AR model (Qwen3 backbone + audio tokens)
โ โโโ sampling.py # RAS sampling (from VALL-E 2)
โ โโโ utils.py # Utilities
โ โโโ modules/ # Conformer + Perceiver modules
โ โโโ tools/ # Audio & text processing
โโโ third_party/
โ โโโ cosyvoice/ # Flow-matching vocoder
โ โโโ Matcha-TTS/ # Flow matching dependency
โโโ tokenizer/ # Custom tokenizer with special tokens
โโโ pretrained_models/ # Model weights (not in git)
โโโ requirements.txt
Acknowledgements
- CosyVoice โ Flow-matching & Vocoder
- Qwen3 โ LLM backbone
- Matcha-TTS โ Flow matching framework
- MaskGCT โ wav2vec2bert feature statistics
Citation
@article{pilottts2025,
title={PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis},
author={},
year={2025},
journal={arXiv preprint arXiv:xxxx.xxxxx}
}
License
Apache-2.0
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support