PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis

English  |  ไธญๆ–‡

๐Ÿ“‘ Paper  |  ๐Ÿค— HuggingFace  |  ๐Ÿค– ModelScope  |  ๐ŸŽง Demos

News ๐Ÿ“

  • [2025.05] Release Pilot-TTS base and instruct model weights

Highlight ๐Ÿ”ฅ

PilotTTS is an LLM-based text-to-speech (TTS) system that builds an intentionally simplified architecture with fully open-source components and achieves competitive performance through rigorous data engineering.

Key Features

  • A fully open-source data processing pipeline: We design a multi-stage pipeline that incorporates quality assessment and enhancement, annotation, and quality filtering, where all operators are implemented using publicly available tools. This pipeline converts large-scale Internet audio into clean training data with rich annotation, achieving high-quality data generation while substantially reducing costs.
  • Content Consistency and Speaker Similarity Control: On the Seed-TTS test set, our model achieves state-of-the-art speaker similarity (0.862) and highly competitive content accuracy (CER 0.87%).
  • Emotion and Paralinguistic Control: Supports controllable synthesis for 11 emotion categories (Happy, Sad, Fear, Angry, Contempt, Serious, Surprise, Blue, Concern, Disgust, Psychology) and 4 paralinguistic categories (LAUGH, BREATH, CRY, COUGH).
  • Dialect Control: Supports 14 Chinese dialects and enables cross-dialect synthesis, with particular strength in synthesizing from Mandarin Chinese to the target dialect.

Installation โš™๏ธ

Clone and install

git clone https://github.com/xxx/pilot-tts.git
cd pilot-tts

Environment setup

conda create -n pilot-tts python=3.10 -y
conda activate pilot-tts
pip install -r requirements.txt

Model download

1. Pilot-TTS models (our weights)

# ModelScope
from modelscope import snapshot_download
snapshot_download('xxx/Pilot-TTS', local_dir='pretrained_models/')

# HuggingFace
from huggingface_hub import snapshot_download
snapshot_download('xxx/Pilot-TTS', local_dir='pretrained_models/')

This includes: pilot_tts.pt, pilot_tts_instruct.pt, and tokenizer/.

2. Third-party open-source models

Download the following dependencies from their respective open-source projects:

from modelscope import snapshot_download

# Qwen3-0.6B (LLM backbone)
snapshot_download('Qwen/Qwen3-0.6B', local_dir='pretrained_models/Qwen3-0.6B')

# CosyVoice3 (flow-matching vocoder, includes campplus.onnx)
snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/CosyVoice3-0.5B')
from huggingface_hub import snapshot_download

# w2v-bert-2.0 (audio feature extractor)
snapshot_download('facebook/w2v-bert-2.0', local_dir='pretrained_models/w2v-bert-2.0')

Note: wav2vec2bert_stats.pt (from MaskGCT) is included in the Pilot-TTS model package.

Final directory structure

pretrained_models/
โ”œโ”€โ”€ pilot_tts.pt              # Base model (zero-shot voice cloning)
โ”œโ”€โ”€ pilot_tts_instruct.pt     # Instruct model (emotion, paralanguage, dialect)
โ”œโ”€โ”€ Qwen3-0.6B/              # LLM backbone (from Qwen)
โ”œโ”€โ”€ w2v-bert-2.0/            # Audio feature extractor (from Meta)
โ”œโ”€โ”€ wav2vec2bert_stats.pt    # Feature normalization stats (from MaskGCT)
โ””โ”€โ”€ CosyVoice3-0.5B/        # Flow-matching vocoder (from FunAudioLLM)

Quick Start ๐Ÿ“–

Run all inference demos with a single command:

python demo.py

Inference

Python API

from demo import load_engine, synthesize

# Zero-shot voice cloning (base model)
engine = load_engine(
    config_path="configs/infer_pilot_tts.yaml",
    checkpoint="pretrained_models/pilot_tts.pt",
)
synthesize(engine, text="ไฝ ๅฅฝ๏ผŒไธ–็•Œ๏ผ",
           prompt_wav="assert/prompt.wav",
           output_path="output/clone.wav")

# Load instruct model (emotion, paralanguage, dialect)
engine_instruct = load_engine(
    config_path="configs/infer_pilot_tts_instruct.yaml",
    checkpoint="pretrained_models/pilot_tts_instruct.pt",
)

# Emotion synthesis
synthesize(engine_instruct, text="ไปŠๅคฉๅคฉๆฐ”็œŸๅฅฝๅ•Š๏ผ",
           prompt_wav="assert/prompt.wav",
           emotion="happy", output_path="output/happy.wav")

# Paralanguage
synthesize(engine_instruct, text="่ฟ™ๅคชๅฅฝ็ฌ‘ไบ†<|LAUGH|>ๅœไธไธ‹ๆฅ",
           prompt_wav="assert/prompt.wav",
           output_path="output/laugh.wav")

# Dialect (Henan)
synthesize(engine_instruct, text="ไธญไธไธญๅ•Š๏ผŒๅ’ฑไฟฉไธ€ๅ—ๅ„ฟๅŽปๅƒ่ƒก่พฃๆฑคๅง",
           prompt_wav="assert/prompt.wav",
           language="zh-henan", output_path="output/henan.wav")

Command Line

# Zero-shot voice cloning (base model)
python inference.py \
    --checkpoint pretrained_models/pilot_tts.pt \
    --prompt-wav assert/prompt.wav \
    --text "้œ€่ฆๅˆๆˆ็š„็›ฎๆ ‡ๆ–‡ๆœฌ" \
    --output output/zeroshot.wav

# Emotion synthesis (instruct model)
python inference.py \
    --config configs/infer_pilot_tts_instruct.yaml \
    --checkpoint pretrained_models/pilot_tts_instruct.pt \
    --prompt-wav assert/prompt.wav \
    --text "ไปŠๅคฉๅคฉๆฐ”็œŸๅฅฝๅ•Š๏ผŒๆˆ‘ไปฌๅŽปๅ…ฌๅ›ญ็Žฉๅง๏ผ" \
    --emotion happy \
    --output output/emotion.wav

# Paralanguage (instruct model)
python inference.py \
    --config configs/infer_pilot_tts_instruct.yaml \
    --checkpoint pretrained_models/pilot_tts_instruct.pt \
    --prompt-wav assert/prompt.wav \
    --text "่ฟ™ไธช็ฌ‘่ฏๅคชๅฅฝ็ฌ‘ไบ†<|LAUGH|>ๆˆ‘็œŸ็š„ๅฟไธไฝ" \
    --output output/paralang.wav

# Dialect synthesis (instruct model)
python inference.py \
    --config configs/infer_pilot_tts_instruct.yaml \
    --checkpoint pretrained_models/pilot_tts_instruct.pt \
    --prompt-wav assert/prompt.wav \
    --text "ไธญไธไธญๅ•Š๏ผŒๅ’ฑไฟฉไธ€ๅ—ๅ„ฟๅŽปๅƒ่ƒก่พฃๆฑคๅง" \
    --language zh-henan \
    --output output/dialect.wav

Supported Controls

Feature Usage Model
Voice Cloning Provide prompt audio Both
Emotions --emotion <tag> Instruct
Paralanguage Insert tags in text Instruct
Dialects --language <dialect> Instruct

Emotions:

Tag ๆƒ…ๆ„Ÿ Tag ๆƒ…ๆ„Ÿ
happy ๅผ€ๅฟƒ sad ๆ‚ฒไผค
angry ๆ„คๆ€’ surprise ๆƒŠ่ฎถ
fear ๆๆƒง disgust ๅŽŒๆถ
serious ไธฅ่‚ƒ concern ๅ…ณๅˆ‡
blue ๅฟง้ƒ disdain ่ฝป่”‘
neutral ไธญๆ€ง/ๅนณ้™ psychology ๅฟƒ็†ๆดปๅŠจ
unknown ไธๆŒ‡ๅฎšๆƒ…ๆ„Ÿ

Paralanguage tags:

Tag Description
<|LAUGH|> ็ฌ‘ๅฃฐ
<|BREATH|> ๅ‘ผๅธๅฃฐ
<|COUGH|> ๅ’ณๅ—ฝ
<|CRY|> ๅ“ญๆณฃๅฃฐ
<|LAUGH_SPAN|>...<|/LAUGH_SPAN|> ๅŒ…่ฃน็ฌ‘ๅฃฐๆ–‡ๆœฌ

Dialects:

Tag ๆ–น่จ€ Tag ๆ–น่จ€
zh-dongbei ไธœๅŒ—่ฏ zh-shandong ๅฑฑไธœ่ฏ
zh-henan ๆฒณๅ—่ฏ zh-shan1xi ๅฑฑ่ฅฟ่ฏ
zh-minnan ้—ฝๅ—่ฏญ zh-gansu ็”˜่‚ƒ่ฏ
zh-ningxia ๅฎๅค่ฏ zh-shanghai ไธŠๆตท่ฏ
zh-chongqing ้‡ๅบ†่ฏ zh-hubei ๆน–ๅŒ—่ฏ
zh-hunan ๆน–ๅ—่ฏ zh-jiangxi ๆฑŸ่ฅฟ่ฏ
zh-guizhou ่ดตๅทž่ฏ zh-yunnan ไบ‘ๅ—่ฏ

WebUI

Launch a Gradio-based interactive interface:

python webui.py --port 9000

Project Structure

pilot-tts/
โ”œโ”€โ”€ configs/                     # Inference configurations (per checkpoint)
โ”œโ”€โ”€ demo.py                      # Complete demo (all inference modes)
โ”œโ”€โ”€ inference.py                 # CLI inference entry
โ”œโ”€โ”€ webui.py                     # Gradio WebUI
โ”œโ”€โ”€ asset/                       # Example prompt audio
โ”œโ”€โ”€ pilot_voice/                 # Core model code
โ”‚   โ”œโ”€โ”€ engine.py                # InferenceEngine pipeline
โ”‚   โ”œโ”€โ”€ model.py                 # AR model (Qwen3 backbone + audio tokens)
โ”‚   โ”œโ”€โ”€ sampling.py              # RAS sampling (from VALL-E 2)
โ”‚   โ”œโ”€โ”€ utils.py                 # Utilities
โ”‚   โ”œโ”€โ”€ modules/                 # Conformer + Perceiver modules
โ”‚   โ””โ”€โ”€ tools/                   # Audio & text processing
โ”œโ”€โ”€ third_party/
โ”‚   โ”œโ”€โ”€ cosyvoice/               # Flow-matching vocoder
โ”‚   โ””โ”€โ”€ Matcha-TTS/              # Flow matching dependency
โ”œโ”€โ”€ tokenizer/                   # Custom tokenizer with special tokens
โ”œโ”€โ”€ pretrained_models/           # Model weights (not in git)
โ””โ”€โ”€ requirements.txt

Acknowledgements

  • CosyVoice โ€” Flow-matching & Vocoder
  • Qwen3 โ€” LLM backbone
  • Matcha-TTS โ€” Flow matching framework
  • MaskGCT โ€” wav2vec2bert feature statistics

Citation

@article{pilottts2025,
      title={PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis},
      author={},
      year={2025},
      journal={arXiv preprint arXiv:xxxx.xxxxx}
}

License

Apache-2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support