IaraTTS Instruction LM v1 β€” Phase 4.4

45M-param Qwen3-style decoder-only transformer trained via InfoNCE contrastive loss on (instruction, audio_features 256d) pairs.

Architecture

  • 4 layers, 8 heads, hidden 576, FFN ~2.3K (45M total)
  • Vocab: toy hash tokenizer (TODO: replace with real gpt2-pt tokenizer)
  • Output: 256-d emotion conditioning vector

Training

  • Dataset: 4197 (instruction, audio) pairs from Erinome via WavLM-base 768d β†’ PCA 256d
  • Loss: symmetric InfoNCE, temperature=0.07
  • 10 epochs, 1320 steps, AdamW lr=3e-4, batch_size=32
  • ~5min CPU training (model small enough for laptop)
  • Final loss: 1.67 InfoNCE (random baseline = ln(32) β‰ˆ 3.47)

Files

  • model.pt β€” final weights (175MB)
  • step-500.pt, step-1000.pt β€” mid-train snapshots
  • config.json β€” InstructionLMConfig

Use case (intended)

Run once per request to map natural-language instruction ("fale triste e devagar") into a 256-d emotion conditioning vector that gets injected as cross-attention key into the AR backbone (Phase 5.5+).

Limitations

  • Toy hash tokenizer needs replacement before serious deployment
  • 45M may be undersized β€” increase to 60-100M in v2 if needed
  • Pre-aligned WavLM features work but lose dynamics; consider mel + temporal pooling

License: MIT

Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support