LFM2.5-Embedding-350M — CrispEmbed GGUF

CrispEmbed-native GGUF quantizations of LiquidAI/LFM2.5-Embedding-350M.

Format note: These GGUFs use CrispEmbed's internal tensor naming (lfm.* prefix, arch=lfm2). They are not interchangeable with the official LiquidAI GGUFs which target llama.cpp (lfm2-bidir arch, blk.* tensor naming). Use the LiquidAI GGUFs if you want llama.cpp/llama-server.

Files

File	Size	Description
`lfm2-embed-q8_0.gguf`	359 MB	8-bit quantization — best accuracy, recommended
`lfm2-embed-q4_k.gguf`	222 MB	4-bit K-quant — 3× compression, minimal quality loss
`lfm2-embed-f16.gguf`	678 MB	Full fp16 — reference precision

Parity (CrispEmbed q8_0 vs HF float32 `Lfm2BidirectionalModel`)

Stage	Cosine	Notes
per-layer (all 20)	≥ 0.9999	measured on 3-token input via test-lfm2-diff
CLS embedding q8_0	0.9999	5 diverse test sentences
CLS embedding q4_k	0.982	expected q4_k quantization floor

Model

Architecture: 16-layer hybrid (10 ShortConv + 6 GQA attention), hidden=1024
Pooling: CLS token (position 0) of last hidden state, L2-normalized
Dimension: 1024
Languages: 11 (en, de, fr, es, it, pt, nl, pl, ru, ja, zh)
Parameters: 350M
Task prefixes: "query: " for queries, "document: " for passages

Usage with CrispEmbed

CLI

# Download
./crispembed --download lfm2-embed

# Embed a query (prefix auto-applied)
./crispembed -m ~/.cache/crispembed/lfm2-embed-q8_0.gguf "What is the capital of France?"

# Embed a document (disable auto-prefix and supply explicitly, or use --prefix)
./crispembed -m ~/.cache/crispembed/lfm2-embed-q8_0.gguf \
  --prefix "document: " "Paris is the capital of France."

# JSON output for downstream use
./crispembed -m ~/.cache/crispembed/lfm2-embed-q8_0.gguf --json "query: machine learning"

Python (via crispembed Python bindings)

import crispembed

model = crispembed.load("~/.cache/crispembed/lfm2-embed-q8_0.gguf")

query_emb = model.encode("query: What is the capital of France?")
doc_emb   = model.encode("document: Paris is the capital of France.")

import numpy as np
score = np.dot(query_emb, doc_emb)  # both are already L2-normalized
print(f"Similarity: {score:.4f}")

Rust

use crispembed::CrispEmbed;

let model = CrispEmbed::load("lfm2-embed-q8_0.gguf")?;
let emb = model.encode("query: hello world")?;

Comparison with official LiquidAI GGUFs

	This repo	LiquidAI/LFM2.5-Embedding-350M-GGUF
Runtime	CrispEmbed	llama.cpp / llama-server
GGUF arch tag	`lfm2`	`lfm2-bidir`
Tensor naming	`lfm.*` prefix	`blk.*` / llama.cpp convention
Quantizations	f16, q8_0, q4_k	BF16, F16, Q4_0, Q4_K_M, Q5_K_M, Q6_K, Q8_0
q8_0 size	359 MB	379 MB
Metal GPU	Yes (Apple Silicon)	Yes

Conversion

Convert from the source model yourself:

git clone https://github.com/CrispStrobe/CrispEmbed
cd CrispEmbed

# Download source
python models/convert-lfm2-embed-to-gguf.py \
    --model LiquidAI/LFM2.5-Embedding-350M \
    --output lfm2-embed-f16.gguf --dtype f16

# Quantize
./build/crispembed-quantize lfm2-embed-f16.gguf lfm2-embed-q8_0.gguf q8_0
./build/crispembed-quantize lfm2-embed-f16.gguf lfm2-embed-q4_k.gguf q4_k

License

LFM1.0 — same as the base model.

Downloads last month: -

GGUF

Model size

0.4B params

Architecture

lfm2

Hardware compatibility

8-bit

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cstr/lfm2-embed-GGUF

Base model

LiquidAI/LFM2.5-350M-Base

Finetuned

LiquidAI/LFM2.5-Embedding-350M

Quantized

(2)

this model