IaraTTS — SFT v3 (PoC MVP, paralinguistic tags)

Brazilian Portuguese TTS with paralinguistic tag rendering — full SFT of MOSS-TTS-Nano-100M with vocab extended for nonverbal tokens.

What's new vs v1/v2

Special tokens added: <sigh>, <laugh>, <yawn>, <gasp>, <groan>, <chuckle>, <cough>, <sniffle> registered as single token IDs (16384–16391). Tokenizer no longer fragments them into BPE pieces.
Trained 10 epochs at lr=5e-5 (vs 3 epochs lr=5e-5 in v1, 5 epochs lr=3e-5 in v2). 1530 steps total, ~30 min on RTX 4090.
Loss trajectory: 5.02 → 4.51 (better than v2's 4.84).
Coverage: 4869 train pairs, 51% with instruction/tag conditioning.

Qualitative test (9 demo prompts)

User feedback: nonverbal sounds (sigh/laugh/yawn/gasp/chuckle/groan) render audibly in v3 — partial quality, not perfect, but the tag→sound mapping is working. v1/v2 produced only speech without the tag sound.

Training

Hyperparam	Value
Base model	`OpenMOSS-Team/MOSS-TTS-Nano-100M` (vocab extended to 16392)
Codec	`OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano`
Dataset	`marcosremar2/gemini-dataset-erinome` (4869 pairs after intersection)
per_device_batch_size	8
gradient_accumulation_steps	4
global_batch_size	32
epochs	10 (1530 steps)
learning_rate	5e-5 cosine, warmup 5%
mixed_precision	bf16
attn_implementation	sdpa
GPU	RTX 4090 (Vast.ai, ~$0.07/hr)
Wall time	~30 min
Loss	5.02 → 4.51
Total cost	~$0.04

Special token integration

The MOSS-TTS-Nano text_lm_head layer was resized from (16384, 768) → (16392, 768). New embeddings initialized from multivariate normal (mean+covariance of existing embeddings, per Hewitt 2021 vocab expansion technique). Both vocab_size and nested gpt2_config.vocab_size updated in config.json.

Open issues / next iterations

Tag rendering quality: ~50–60% rate, not 100%. Needs more epochs (15–20) or upsampling of tagged examples in training set.
WER on holdout not yet measured with v3 (eval rounds 30–60 prompts pending).
Phase 3.6 (Meta Quest viseme stream) and Phase 3.7 (style continuity hybrid) still to implement.

Inference

from transformers import AutoModel, AutoTokenizer
m = AutoModel.from_pretrained("marcosremar2/iaratts-sft-v3", trust_remote_code=True)
t = AutoTokenizer.from_pretrained("marcosremar2/iaratts-sft-v3", trust_remote_code=True)

# tags work as single tokens now:
ids = t.encode("<sigh> Que cansaço.", add_special_tokens=False)
# -> [16384, ..., space, ..., Que, ..., cansaço, ...]

CLI via MOSS infer.py:

python infer.py --checkpoint marcosremar2/iaratts-sft-v3 \
  --audio-tokenizer-pretrained-name-or-path OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano \
  --text "<sigh> Encontrei um erro no código." \
  --output-audio-path out.wav --mode continuation --seed 42

v1: marcosremar2/iaratts-sft-v1 — Phase 2 baseline (3 epochs, no tags)
Roadmap: marcosremar2/iaratts-roadmap

License

MIT — same as upstream MOSS-TTS-Nano.

Downloads last month: 12

Model tree for marcosremar2/iaratts-sft-v3

Base model

OpenMOSS-Team/MOSS-TTS-Nano-100M

Finetuned

(6)

this model

marcosremar2
/

iaratts-sft-v3