IaraTTS Instruction LM v1 β Phase 4.4
45M-param Qwen3-style decoder-only transformer trained via InfoNCE contrastive loss on (instruction, audio_features 256d) pairs.
Architecture
- 4 layers, 8 heads, hidden 576, FFN ~2.3K (45M total)
- Vocab: toy hash tokenizer (TODO: replace with real gpt2-pt tokenizer)
- Output: 256-d emotion conditioning vector
Training
- Dataset: 4197 (instruction, audio) pairs from Erinome via WavLM-base 768d β PCA 256d
- Loss: symmetric InfoNCE, temperature=0.07
- 10 epochs, 1320 steps, AdamW lr=3e-4, batch_size=32
- ~5min CPU training (model small enough for laptop)
- Final loss: 1.67 InfoNCE (random baseline = ln(32) β 3.47)
Files
model.ptβ final weights (175MB)step-500.pt,step-1000.ptβ mid-train snapshotsconfig.jsonβ InstructionLMConfig
Use case (intended)
Run once per request to map natural-language instruction ("fale triste e devagar") into a 256-d emotion conditioning vector that gets injected as cross-attention key into the AR backbone (Phase 5.5+).
Limitations
- Toy hash tokenizer needs replacement before serious deployment
- 45M may be undersized β increase to 60-100M in v2 if needed
- Pre-aligned WavLM features work but lose dynamics; consider mel + temporal pooling
License: MIT
- Downloads last month
- 6
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support