LFM2.5-CT-230M
LFM2.5-CT-230M is a custom audio-text-to-text model fusing two architectures:
| Component | Source | Role |
|---|---|---|
| FastConformer-TDT Encoder | nvidia/parakeet-tdt-0.6b-v3 |
Audio → frame embeddings (512-dim) |
| LFM2.5 Decoder | LiquidAI/LFM2.5-230M |
Frame embeddings + text → text generation |
| Audio Projection | Custom (2-layer MLP + LayerNorm) | 512 → 1024 projection |
Architecture Details
The soft latent interface (C2A §3.2) projects FastConformer encoder hidden states into the LFM2.5 hidden space via:
Ẑ = AudioProjection(ASR_encoder(x)) ∈ R^(B × T_audio × 1024)
inputs_embeds = concat([Ẑ, TokenEmbeds(prompt)], dim=1)
output = LFM2.5_decoder(inputs_embeds)
This allows end-to-end gradient flow from LM reward signal back through the audio encoder — the key requirement of C2A joint optimisation.
Hardware
Trained on GTX 1080 (8 GB VRAM, sm_61) using fp16, gradient checkpointing, and micro-batch accumulation.
Intended Use
- Speech-to-text with LLM-style reasoning
- Voice-driven task completion (C2A domains: food ordering, scheduling, navigation)
- Research into clicker-conditioned RLHF for ASR+LLM systems
Limitations
- ASR encoder supports 25 European languages (parakeet-tdt-0.6b-v3 languages).
- LFM2.5 decoder is text-only; no audio generation.
- Not recommended for safety-critical applications without human oversight.
ALL ai generated
- Downloads last month
- -