PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis

English | 中文

📑 Paper | 🤗 HuggingFace | 🤖 ModelScope | 🎧 Demos

News 📝

[2025.05] Release Pilot-TTS base and instruct model weights

Highlight 🔥

PilotTTS is an LLM-based text-to-speech (TTS) system that builds an intentionally simplified architecture with fully open-source components and achieves competitive performance through rigorous data engineering.

Key Features

A fully open-source data processing pipeline: We design a multi-stage pipeline that incorporates quality assessment and enhancement, annotation, and quality filtering, where all operators are implemented using publicly available tools. This pipeline converts large-scale Internet audio into clean training data with rich annotation, achieving high-quality data generation while substantially reducing costs.
Content Consistency and Speaker Similarity Control: On the Seed-TTS test set, our model achieves state-of-the-art speaker similarity (0.862) and highly competitive content accuracy (CER 0.87%).
Emotion and Paralinguistic Control: Supports controllable synthesis for 11 emotion categories (Happy, Sad, Fear, Angry, Contempt, Serious, Surprise, Blue, Concern, Disgust, Psychology) and 4 paralinguistic categories (LAUGH, BREATH, CRY, COUGH).
Dialect Control: Supports 14 Chinese dialects and enables cross-dialect synthesis, with particular strength in synthesizing from Mandarin Chinese to the target dialect.

Installation ⚙️

Clone and install

git clone https://github.com/xxx/pilot-tts.git
cd pilot-tts

Environment setup

conda create -n pilot-tts python=3.10 -y
conda activate pilot-tts
pip install -r requirements.txt

Model download

1. Pilot-TTS models (our weights)

# ModelScope
from modelscope import snapshot_download
snapshot_download('xxx/Pilot-TTS', local_dir='pretrained_models/')

# HuggingFace
from huggingface_hub import snapshot_download
snapshot_download('xxx/Pilot-TTS', local_dir='pretrained_models/')

This includes: pilot_tts.pt, pilot_tts_instruct.pt, and tokenizer/.

2. Third-party open-source models

Download the following dependencies from their respective open-source projects:

from modelscope import snapshot_download

# Qwen3-0.6B (LLM backbone)
snapshot_download('Qwen/Qwen3-0.6B', local_dir='pretrained_models/Qwen3-0.6B')

# CosyVoice3 (flow-matching vocoder, includes campplus.onnx)
snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/CosyVoice3-0.5B')

from huggingface_hub import snapshot_download

# w2v-bert-2.0 (audio feature extractor)
snapshot_download('facebook/w2v-bert-2.0', local_dir='pretrained_models/w2v-bert-2.0')

Note: wav2vec2bert_stats.pt (from MaskGCT) is included in the Pilot-TTS model package.

Final directory structure

pretrained_models/
├── pilot_tts.pt              # Base model (zero-shot voice cloning)
├── pilot_tts_instruct.pt     # Instruct model (emotion, paralanguage, dialect)
├── Qwen3-0.6B/              # LLM backbone (from Qwen)
├── w2v-bert-2.0/            # Audio feature extractor (from Meta)
├── wav2vec2bert_stats.pt    # Feature normalization stats (from MaskGCT)
└── CosyVoice3-0.5B/        # Flow-matching vocoder (from FunAudioLLM)

Quick Start 📖

Run all inference demos with a single command:

python demo.py

Inference

Python API

from demo import load_engine, synthesize

# Zero-shot voice cloning (base model)
engine = load_engine(
    config_path="configs/infer_pilot_tts.yaml",
    checkpoint="pretrained_models/pilot_tts.pt",
)
synthesize(engine, text="你好，世界！",
           prompt_wav="assert/prompt.wav",
           output_path="output/clone.wav")

# Load instruct model (emotion, paralanguage, dialect)
engine_instruct = load_engine(
    config_path="configs/infer_pilot_tts_instruct.yaml",
    checkpoint="pretrained_models/pilot_tts_instruct.pt",
)

# Emotion synthesis
synthesize(engine_instruct, text="今天天气真好啊！",
           prompt_wav="assert/prompt.wav",
           emotion="happy", output_path="output/happy.wav")

# Paralanguage
synthesize(engine_instruct, text="这太好笑了<|LAUGH|>停不下来",
           prompt_wav="assert/prompt.wav",
           output_path="output/laugh.wav")

# Dialect (Henan)
synthesize(engine_instruct, text="中不中啊，咱俩一块儿去吃胡辣汤吧",
           prompt_wav="assert/prompt.wav",
           language="zh-henan", output_path="output/henan.wav")

Command Line

# Zero-shot voice cloning (base model)
python inference.py \
    --checkpoint pretrained_models/pilot_tts.pt \
    --prompt-wav assert/prompt.wav \
    --text "需要合成的目标文本" \
    --output output/zeroshot.wav

# Emotion synthesis (instruct model)
python inference.py \
    --config configs/infer_pilot_tts_instruct.yaml \
    --checkpoint pretrained_models/pilot_tts_instruct.pt \
    --prompt-wav assert/prompt.wav \
    --text "今天天气真好啊，我们去公园玩吧！" \
    --emotion happy \
    --output output/emotion.wav

# Paralanguage (instruct model)
python inference.py \
    --config configs/infer_pilot_tts_instruct.yaml \
    --checkpoint pretrained_models/pilot_tts_instruct.pt \
    --prompt-wav assert/prompt.wav \
    --text "这个笑话太好笑了<|LAUGH|>我真的忍不住" \
    --output output/paralang.wav

# Dialect synthesis (instruct model)
python inference.py \
    --config configs/infer_pilot_tts_instruct.yaml \
    --checkpoint pretrained_models/pilot_tts_instruct.pt \
    --prompt-wav assert/prompt.wav \
    --text "中不中啊，咱俩一块儿去吃胡辣汤吧" \
    --language zh-henan \
    --output output/dialect.wav

Supported Controls

Feature	Usage	Model
Voice Cloning	Provide prompt audio	Both
Emotions	`--emotion <tag>`	Instruct
Paralanguage	Insert tags in text	Instruct
Dialects	`--language <dialect>`	Instruct

Emotions:

Tag	情感	Tag	情感
`happy`	开心	`sad`	悲伤
`angry`	愤怒	`surprise`	惊讶
`fear`	恐惧	`disgust`	厌恶
`serious`	严肃	`concern`	关切
`blue`	忧郁	`disdain`	轻蔑
`neutral`	中性/平静	`psychology`	心理活动
`unknown`	不指定情感

Paralanguage tags:

Tag	Description
`<\|LAUGH\|>`	笑声
`<\|BREATH\|>`	呼吸声
`<\|COUGH\|>`	咳嗽
`<\|CRY\|>`	哭泣声
`<\|LAUGH_SPAN\|>...<\|/LAUGH_SPAN\|>`	包裹笑声文本

Dialects:

Tag	方言	Tag	方言
`zh-dongbei`	东北话	`zh-shandong`	山东话
`zh-henan`	河南话	`zh-shan1xi`	山西话
`zh-minnan`	闽南语	`zh-gansu`	甘肃话
`zh-ningxia`	宁夏话	`zh-shanghai`	上海话
`zh-chongqing`	重庆话	`zh-hubei`	湖北话
`zh-hunan`	湖南话	`zh-jiangxi`	江西话
`zh-guizhou`	贵州话	`zh-yunnan`	云南话

WebUI

Launch a Gradio-based interactive interface:

python webui.py --port 9000

Project Structure

pilot-tts/
├── configs/                     # Inference configurations (per checkpoint)
├── demo.py                      # Complete demo (all inference modes)
├── inference.py                 # CLI inference entry
├── webui.py                     # Gradio WebUI
├── asset/                       # Example prompt audio
├── pilot_voice/                 # Core model code
│   ├── engine.py                # InferenceEngine pipeline
│   ├── model.py                 # AR model (Qwen3 backbone + audio tokens)
│   ├── sampling.py              # RAS sampling (from VALL-E 2)
│   ├── utils.py                 # Utilities
│   ├── modules/                 # Conformer + Perceiver modules
│   └── tools/                   # Audio & text processing
├── third_party/
│   ├── cosyvoice/               # Flow-matching vocoder
│   └── Matcha-TTS/              # Flow matching dependency
├── tokenizer/                   # Custom tokenizer with special tokens
├── pretrained_models/           # Model weights (not in git)
└── requirements.txt

Acknowledgements

CosyVoice — Flow-matching & Vocoder
Qwen3 — LLM backbone
Matcha-TTS — Flow matching framework
MaskGCT — wav2vec2bert feature statistics

Citation

@article{pilottts2025,
      title={PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis},
      author={},
      year={2025},
      journal={arXiv preprint arXiv:xxxx.xxxxx}
}

License

Apache-2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support