IaraTTS SFT v4 — 15 epochs + 3x tag upsampling

Eval results (60-prompt pt-BR holdout, Whisper-base round-trip)

Model WER Δ vs v3
SFT v3 (10 epochs) 0.1928 baseline
SFT v4 (this) 0.1646 −14.6% relative

Training

  • 7449 samples (3579 original + 3870 upsampled tagged)
  • 3x upsampling on <laugh>, <sigh>, <yawn>, <gasp>, <groan>, <chuckle>, <cough>, <sniffle>
  • Vocab 16384 → 16392 (8 special tokens single-id)
  • 15 epochs, 3495 steps, lr 5e-5 cosine + warmup 5%, bf16 sdpa
  • ~70 min on RTX 4090, ~$0.30 GPU

Inference

from transformers import AutoModel, AutoTokenizer
m = AutoModel.from_pretrained("marcosremar2/iaratts-sft-v4", trust_remote_code=True)
t = AutoTokenizer.from_pretrained("marcosremar2/iaratts-sft-v4", trust_remote_code=True)
# Tags as single tokens:
ids = t.encode("<laugh> Que dia! Estou exausto.", add_special_tokens=False)

CLI via MOSS infer.py:

python infer.py --checkpoint marcosremar2/iaratts-sft-v4 \
  --audio-tokenizer-pretrained-name-or-path OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano \
  --text "<sigh> Encontrei um erro no código." \
  --output-audio-path out.wav --mode continuation --seed 42

Roadmap position

Phase 4 wave complete. Next: Phase 4.4 (IndexTTS2 instruction LM) or Phase 5.5 (CosyVoice-2 distill to 150M streaming).

License: MIT (same as upstream MOSS-TTS-Nano).

Downloads last month
46
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for marcosremar2/iaratts-sft-v4

Finetuned
(6)
this model

Dataset used to train marcosremar2/iaratts-sft-v4