LFM2.5-CT-230M

LFM2.5-CT-230M is a custom audio-text-to-text model fusing two architectures:

Component	Source	Role
FastConformer-TDT Encoder	`nvidia/parakeet-tdt-0.6b-v3`	Audio → frame embeddings (512-dim)
LFM2.5 Decoder	`LiquidAI/LFM2.5-230M`	Frame embeddings + text → text generation
Audio Projection	Custom (2-layer MLP + LayerNorm)	512 → 1024 projection

Architecture Details

The soft latent interface (C2A §3.2) projects FastConformer encoder hidden states into the LFM2.5 hidden space via:

Ẑ = AudioProjection(ASR_encoder(x)) ∈ R^(B × T_audio × 1024)
inputs_embeds = concat([Ẑ, TokenEmbeds(prompt)], dim=1)
output = LFM2.5_decoder(inputs_embeds)

This allows end-to-end gradient flow from LM reward signal back through the audio encoder — the key requirement of C2A joint optimisation.

Hardware

Trained on GTX 1080 (8 GB VRAM, sm_61) using fp16, gradient checkpointing, and micro-batch accumulation.

Intended Use

Speech-to-text with LLM-style reasoning
Voice-driven task completion (C2A domains: food ordering, scheduling, navigation)
Research into clicker-conditioned RLHF for ASR+LLM systems

Limitations

ASR encoder supports 25 European languages (parakeet-tdt-0.6b-v3 languages).
LFM2.5 decoder is text-only; no audio generation.
Not recommended for safety-critical applications without human oversight.

ALL ai generated

Downloads last month: -

Model tree for ray0rf1re/lfm2.5-CT-230m

Base model

LiquidAI/LFM2.5-230M-Base

Finetuned

LiquidAI/LFM2.5-230M

Finetuned

(15)

this model