LFM2.5-Embedding-350M-int8 (MLX, affine)

MLX affine (group_size=64, 8-bit) quantization of LiquidAI/LFM2.5-Embedding-350M, a bidirectional LFM2.5 dense bi-encoder (1024-d CLS vector) for multilingual retrieval. Runs on Apple Silicon via MLX.

  • Weights: 377 MB (affine (group_size=64, 8-bit))
  • Encode throughput: 13919 tokens/sec (M-series, this benchmark)
  • Pooling / scoring: CLS token, cosine similarity
  • Prompts: query: / document: (required — trained with them)

Retrieval quality — English NanoBEIR (4 datasets, 50 queries each)

Dataset NDCG@10 (fp16) NDCG@10 (int8) Recall@10 (fp16) Recall@10 (int8)
NanoNQ 0.7135 0.7090 0.8000 0.8000
NanoFiQA2018 0.5632 0.5655 0.6443 0.6543
NanoSciFact 0.7461 0.7475 0.8500 0.8500
NanoNFCorpus 0.3645 0.3643 0.1583 0.1583
Mean 0.5968 0.5966 0.6131 0.6156

Mean NDCG@10 retention vs fp16: 100.0% (fp16 0.5968 → int8 0.5966). fp16 baseline encode: 14496 tok/s.

Benchmarked on a fixed 4-dataset English NanoBEIR subset to measure the quantization quality delta vs fp16 (not the full multilingual suite — see the base model card for published numbers).

Usage (MLX)

# pip install mlx mlx-lm transformers
# This repo bundles `mlx_lfm2_encoder.py` — the bidirectional LFM2 encoder
# (CLS pooling / ColBERT MaxSim) that the stock causal LFM2 loaders do NOT provide.
import mlx.core as mx
from transformers import AutoTokenizer
from mlx_lfm2_encoder import load_model

tok = AutoTokenizer.from_pretrained(".", trust_remote_code=True)
model, _ = load_model(".", head="embedding")     # head: "embedding" or "colbert"

# Asymmetric prompts (REQUIRED — the model was trained with them):
def encode(texts, prefix):
    enc = tok([prefix + t for t in texts], return_tensors="np", padding=True,
              truncation=True, max_length=512)
    out = model(mx.array(enc["input_ids"]), mx.array(enc["attention_mask"]))
    mx.eval(out)
    return out  # (B, 1024) CLS-pooled, L2-normalized

q = encode(["was the nightmare before christmas a disney film"], "query: ")
d = encode(["The Nightmare Before Christmas is a 1993 stop-motion film ..."], "document: ")
scores = (q @ d.T)  # cosine similarity

Why a bundled loader?

These are bidirectional encoders (non-causal attention + non-causal short-conv + CLS pooling). General-purpose causal LFM2 loaders produce wrong embeddings here, so this repo ships mlx_lfm2_encoder.py (validated to cosine ≥ 0.999 against the original transformers model). For the broader MLX embedding ecosystem see mlx-embeddings.

License

Inherits the LFM Open License v1.0 (lfm1.0) from the base model.

Downloads last month
135
Safetensors
Model size
99.7M params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sahilchachra/LFM2.5-Embedding-350M-int8

Finetuned
(11)
this model

Collection including sahilchachra/LFM2.5-Embedding-350M-int8