LFM2.5-CT-230M

LFM2.5-CT-230M is a custom audio-text-to-text model fusing two architectures:

Component Source Role
FastConformer-TDT Encoder nvidia/parakeet-tdt-0.6b-v3 Audio → frame embeddings (512-dim)
LFM2.5 Decoder LiquidAI/LFM2.5-230M Frame embeddings + text → text generation
Audio Projection Custom (2-layer MLP + LayerNorm) 512 → 1024 projection

Architecture Details

The soft latent interface (C2A §3.2) projects FastConformer encoder hidden states into the LFM2.5 hidden space via:

Ẑ = AudioProjection(ASR_encoder(x)) ∈ R^(B × T_audio × 1024)
inputs_embeds = concat([Ẑ, TokenEmbeds(prompt)], dim=1)
output = LFM2.5_decoder(inputs_embeds)

This allows end-to-end gradient flow from LM reward signal back through the audio encoder — the key requirement of C2A joint optimisation.

Hardware

Trained on GTX 1080 (8 GB VRAM, sm_61) using fp16, gradient checkpointing, and micro-batch accumulation.

Intended Use

  • Speech-to-text with LLM-style reasoning
  • Voice-driven task completion (C2A domains: food ordering, scheduling, navigation)
  • Research into clicker-conditioned RLHF for ASR+LLM systems

Limitations

  • ASR encoder supports 25 European languages (parakeet-tdt-0.6b-v3 languages).
  • LFM2.5 decoder is text-only; no audio generation.
  • Not recommended for safety-critical applications without human oversight.

ALL ai generated

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ray0rf1re/lfm2.5-CT-230m

Finetuned
(15)
this model