IaraTTS — SFT v3 (PoC MVP, paralinguistic tags)

Brazilian Portuguese TTS with paralinguistic tag rendering — full SFT of MOSS-TTS-Nano-100M with vocab extended for nonverbal tokens.

What's new vs v1/v2

  • Special tokens added: <sigh>, <laugh>, <yawn>, <gasp>, <groan>, <chuckle>, <cough>, <sniffle> registered as single token IDs (16384–16391). Tokenizer no longer fragments them into BPE pieces.
  • Trained 10 epochs at lr=5e-5 (vs 3 epochs lr=5e-5 in v1, 5 epochs lr=3e-5 in v2). 1530 steps total, ~30 min on RTX 4090.
  • Loss trajectory: 5.02 → 4.51 (better than v2's 4.84).
  • Coverage: 4869 train pairs, 51% with instruction/tag conditioning.

Qualitative test (9 demo prompts)

User feedback: nonverbal sounds (sigh/laugh/yawn/gasp/chuckle/groan) render audibly in v3 — partial quality, not perfect, but the tag→sound mapping is working. v1/v2 produced only speech without the tag sound.

Training

Hyperparam Value
Base model OpenMOSS-Team/MOSS-TTS-Nano-100M (vocab extended to 16392)
Codec OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano
Dataset marcosremar2/gemini-dataset-erinome (4869 pairs after intersection)
per_device_batch_size 8
gradient_accumulation_steps 4
global_batch_size 32
epochs 10 (1530 steps)
learning_rate 5e-5 cosine, warmup 5%
mixed_precision bf16
attn_implementation sdpa
GPU RTX 4090 (Vast.ai, ~$0.07/hr)
Wall time ~30 min
Loss 5.02 → 4.51
Total cost ~$0.04

Special token integration

The MOSS-TTS-Nano text_lm_head layer was resized from (16384, 768)(16392, 768). New embeddings initialized from multivariate normal (mean+covariance of existing embeddings, per Hewitt 2021 vocab expansion technique). Both vocab_size and nested gpt2_config.vocab_size updated in config.json.

Open issues / next iterations

  • Tag rendering quality: ~50–60% rate, not 100%. Needs more epochs (15–20) or upsampling of tagged examples in training set.
  • WER on holdout not yet measured with v3 (eval rounds 30–60 prompts pending).
  • Phase 3.6 (Meta Quest viseme stream) and Phase 3.7 (style continuity hybrid) still to implement.

Inference

from transformers import AutoModel, AutoTokenizer
m = AutoModel.from_pretrained("marcosremar2/iaratts-sft-v3", trust_remote_code=True)
t = AutoTokenizer.from_pretrained("marcosremar2/iaratts-sft-v3", trust_remote_code=True)

# tags work as single tokens now:
ids = t.encode("<sigh> Que cansaço.", add_special_tokens=False)
# -> [16384, ..., space, ..., Que, ..., cansaço, ...]

CLI via MOSS infer.py:

python infer.py --checkpoint marcosremar2/iaratts-sft-v3 \
  --audio-tokenizer-pretrained-name-or-path OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano \
  --text "<sigh> Encontrei um erro no código." \
  --output-audio-path out.wav --mode continuation --seed 42

Related

License

MIT — same as upstream MOSS-TTS-Nano.

Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for marcosremar2/iaratts-sft-v3

Finetuned
(6)
this model

Dataset used to train marcosremar2/iaratts-sft-v3