IaraTTS Instruction LM v1 — Phase 4.4

45M-param Qwen3-style decoder-only transformer trained via InfoNCE contrastive loss on (instruction, audio_features 256d) pairs.

Architecture

4 layers, 8 heads, hidden 576, FFN ~2.3K (45M total)
Vocab: toy hash tokenizer (TODO: replace with real gpt2-pt tokenizer)
Output: 256-d emotion conditioning vector

Training

Dataset: 4197 (instruction, audio) pairs from Erinome via WavLM-base 768d → PCA 256d
Loss: symmetric InfoNCE, temperature=0.07
10 epochs, 1320 steps, AdamW lr=3e-4, batch_size=32
~5min CPU training (model small enough for laptop)
Final loss: 1.67 InfoNCE (random baseline = ln(32) ≈ 3.47)

Files

model.pt — final weights (175MB)
step-500.pt, step-1000.pt — mid-train snapshots
config.json — InstructionLMConfig

Use case (intended)

Run once per request to map natural-language instruction ("fale triste e devagar") into a 256-d emotion conditioning vector that gets injected as cross-attention key into the AR backbone (Phase 5.5+).

Limitations

Toy hash tokenizer needs replacement before serious deployment
45M may be undersized — increase to 60-100M in v2 if needed
Pre-aligned WavLM features work but lose dynamics; consider mel + temporal pooling

License: MIT

Downloads last month: 6

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support