IaraTTS SFT v4 — 15 epochs + 3x tag upsampling

Eval results (60-prompt pt-BR holdout, Whisper-base round-trip)

Model	WER	Δ vs v3
SFT v3 (10 epochs)	0.1928	baseline
SFT v4 (this)	0.1646	−14.6% relative

Training

7449 samples (3579 original + 3870 upsampled tagged)
3x upsampling on <laugh>, <sigh>, <yawn>, <gasp>, <groan>, <chuckle>, <cough>, <sniffle>
Vocab 16384 → 16392 (8 special tokens single-id)
15 epochs, 3495 steps, lr 5e-5 cosine + warmup 5%, bf16 sdpa
~70 min on RTX 4090, ~$0.30 GPU

Inference

from transformers import AutoModel, AutoTokenizer
m = AutoModel.from_pretrained("marcosremar2/iaratts-sft-v4", trust_remote_code=True)
t = AutoTokenizer.from_pretrained("marcosremar2/iaratts-sft-v4", trust_remote_code=True)
# Tags as single tokens:
ids = t.encode("<laugh> Que dia! Estou exausto.", add_special_tokens=False)

CLI via MOSS infer.py:

python infer.py --checkpoint marcosremar2/iaratts-sft-v4 \
  --audio-tokenizer-pretrained-name-or-path OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano \
  --text "<sigh> Encontrei um erro no código." \
  --output-audio-path out.wav --mode continuation --seed 42

Roadmap position

Phase 4 wave complete. Next: Phase 4.4 (IndexTTS2 instruction LM) or Phase 5.5 (CosyVoice-2 distill to 150M streaming).

License: MIT (same as upstream MOSS-TTS-Nano).

Downloads last month: 46

Model tree for marcosremar2/iaratts-sft-v4

Base model

OpenMOSS-Team/MOSS-TTS-Nano-100M

Finetuned

(6)

this model

marcosremar2
/

iaratts-sft-v4