paion-tts-v1

Single-voice emotion-conditioned StyleTTS 2 model trained on Jaiden's recordings for the Paion AI companion project.

Architecture

Base: StyleTTS 2 (LibriTTS pretrained), fine-tuned on 537 voice clips
Decoder: iSTFTNet (mobile-optimized)
Speaker conditioning: 15 emotional "speakers" mapping to Paion's parent feel tags
Output: 24kHz mono speech, 8-bit quantized for mobile

Parent Feel Tags → Speaker IDs

The model conditions on integer speaker IDs, each representing one parent feel tag from the Paion Feel Tag Taxonomy:

Speaker ID	Parent tag
0	tenderness
1	joy
2	sadness
3	fear
4	calm
5	curiosity
6	anger
7	surprise
8	pride
9	relief
10	desire
11	discomfort
12	whisper (delivery mode)
13	soft (delivery mode)
14	urgent (delivery mode)

Runtime Pipeline

LLM produces  <feel>worried, curious</feel> Look I am worried about you...
  ↓
Watcher  →  parent tag = "fear"  →  speaker_id = 3  →  response text only
  ↓
StyleTTS 2 ONNX inference (this model)
  ↓
Pitch shift +4.5 semitones + formant shift 1.15× → female-low voice
  ↓
PCM out

Files

paion_styletts2.onnx — quantized ONNX, ship to mobile
epoch_2nd_*.pth — original PyTorch checkpoint
config_paion.yml — training config
speaker_map.json — parent tag → speaker ID
feel_tag_map.py — feel-word → parent-tag mapper (runtime watcher source)

Training

Hardware: RTX PRO 6000 (Blackwell, 96GB)
Wall time: ~4 hours
Dataset: 537 clips × ~5 sec avg ≈ 45 min total audio
Stages: StyleTTS 2 standard 2-stage (acoustic + SLM adversarial)

Credits

Voice: Jaiden (Jeff's son)
Project: Paion (AI companion expanding human cognition)
Architecture: StyleTTS 2 by Yinghao Aaron Li et al.

Downloads last month: -; Downloads are not tracked for this model. How to track