paion-tts-v1
Single-voice emotion-conditioned StyleTTS 2 model trained on Jaiden's recordings for the Paion AI companion project.
Architecture
- Base: StyleTTS 2 (LibriTTS pretrained), fine-tuned on 537 voice clips
- Decoder: iSTFTNet (mobile-optimized)
- Speaker conditioning: 15 emotional "speakers" mapping to Paion's parent feel tags
- Output: 24kHz mono speech, 8-bit quantized for mobile
Parent Feel Tags β Speaker IDs
The model conditions on integer speaker IDs, each representing one parent feel tag from the Paion Feel Tag Taxonomy:
| Speaker ID | Parent tag |
|---|---|
| 0 | tenderness |
| 1 | joy |
| 2 | sadness |
| 3 | fear |
| 4 | calm |
| 5 | curiosity |
| 6 | anger |
| 7 | surprise |
| 8 | pride |
| 9 | relief |
| 10 | desire |
| 11 | discomfort |
| 12 | whisper (delivery mode) |
| 13 | soft (delivery mode) |
| 14 | urgent (delivery mode) |
Runtime Pipeline
LLM produces <feel>worried, curious</feel> Look I am worried about you...
β
Watcher β parent tag = "fear" β speaker_id = 3 β response text only
β
StyleTTS 2 ONNX inference (this model)
β
Pitch shift +4.5 semitones + formant shift 1.15Γ β female-low voice
β
PCM out
Files
paion_styletts2.onnxβ quantized ONNX, ship to mobileepoch_2nd_*.pthβ original PyTorch checkpointconfig_paion.ymlβ training configspeaker_map.jsonβ parent tag β speaker IDfeel_tag_map.pyβ feel-word β parent-tag mapper (runtime watcher source)
Training
- Hardware: RTX PRO 6000 (Blackwell, 96GB)
- Wall time: ~4 hours
- Dataset: 537 clips Γ ~5 sec avg β 45 min total audio
- Stages: StyleTTS 2 standard 2-stage (acoustic + SLM adversarial)
Credits
- Voice: Jaiden (Jeff's son)
- Project: Paion (AI companion expanding human cognition)
- Architecture: StyleTTS 2 by Yinghao Aaron Li et al.