Instructions to use marcosremar2/iaratts-sft-v3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use marcosremar2/iaratts-sft-v3 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="marcosremar2/iaratts-sft-v3", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("marcosremar2/iaratts-sft-v3", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
IaraTTS — SFT v3 (PoC MVP, paralinguistic tags)
Brazilian Portuguese TTS with paralinguistic tag rendering — full SFT of MOSS-TTS-Nano-100M with vocab extended for nonverbal tokens.
What's new vs v1/v2
- Special tokens added:
<sigh>,<laugh>,<yawn>,<gasp>,<groan>,<chuckle>,<cough>,<sniffle>registered as single token IDs (16384–16391). Tokenizer no longer fragments them into BPE pieces. - Trained 10 epochs at lr=5e-5 (vs 3 epochs lr=5e-5 in v1, 5 epochs lr=3e-5 in v2). 1530 steps total, ~30 min on RTX 4090.
- Loss trajectory: 5.02 → 4.51 (better than v2's 4.84).
- Coverage: 4869 train pairs, 51% with instruction/tag conditioning.
Qualitative test (9 demo prompts)
User feedback: nonverbal sounds (sigh/laugh/yawn/gasp/chuckle/groan) render audibly in v3 — partial quality, not perfect, but the tag→sound mapping is working. v1/v2 produced only speech without the tag sound.
Training
| Hyperparam | Value |
|---|---|
| Base model | OpenMOSS-Team/MOSS-TTS-Nano-100M (vocab extended to 16392) |
| Codec | OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano |
| Dataset | marcosremar2/gemini-dataset-erinome (4869 pairs after intersection) |
| per_device_batch_size | 8 |
| gradient_accumulation_steps | 4 |
| global_batch_size | 32 |
| epochs | 10 (1530 steps) |
| learning_rate | 5e-5 cosine, warmup 5% |
| mixed_precision | bf16 |
| attn_implementation | sdpa |
| GPU | RTX 4090 (Vast.ai, ~$0.07/hr) |
| Wall time | ~30 min |
| Loss | 5.02 → 4.51 |
| Total cost | ~$0.04 |
Special token integration
The MOSS-TTS-Nano text_lm_head layer was resized from (16384, 768) → (16392, 768). New embeddings initialized from multivariate normal (mean+covariance of existing embeddings, per Hewitt 2021 vocab expansion technique). Both vocab_size and nested gpt2_config.vocab_size updated in config.json.
Open issues / next iterations
- Tag rendering quality: ~50–60% rate, not 100%. Needs more epochs (15–20) or upsampling of tagged examples in training set.
- WER on holdout not yet measured with v3 (eval rounds 30–60 prompts pending).
- Phase 3.6 (Meta Quest viseme stream) and Phase 3.7 (style continuity hybrid) still to implement.
Inference
from transformers import AutoModel, AutoTokenizer
m = AutoModel.from_pretrained("marcosremar2/iaratts-sft-v3", trust_remote_code=True)
t = AutoTokenizer.from_pretrained("marcosremar2/iaratts-sft-v3", trust_remote_code=True)
# tags work as single tokens now:
ids = t.encode("<sigh> Que cansaço.", add_special_tokens=False)
# -> [16384, ..., space, ..., Que, ..., cansaço, ...]
CLI via MOSS infer.py:
python infer.py --checkpoint marcosremar2/iaratts-sft-v3 \
--audio-tokenizer-pretrained-name-or-path OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano \
--text "<sigh> Encontrei um erro no código." \
--output-audio-path out.wav --mode continuation --seed 42
Related
- v1: marcosremar2/iaratts-sft-v1 — Phase 2 baseline (3 epochs, no tags)
- Roadmap: marcosremar2/iaratts-roadmap
License
MIT — same as upstream MOSS-TTS-Nano.
- Downloads last month
- 12
Model tree for marcosremar2/iaratts-sft-v3
Base model
OpenMOSS-Team/MOSS-TTS-Nano-100M