LFM2.5-ColBERT-350M-int8 (MLX, affine)

MLX affine (group_size=64, 8-bit) quantization of LiquidAI/LFM2.5-ColBERT-350M, a bidirectional LFM2.5 late-interaction (128-d per-token, MaxSim) for multilingual retrieval. Runs on Apple Silicon via MLX.

Weights: 376 MB (affine (group_size=64, 8-bit))
Encode throughput: 12978 tokens/sec (M-series, this benchmark)
Pooling / scoring: per-token Dense(1024→128), MaxSim late interaction
Prompts: [Q] / [D] markers (required — trained with them)

Retrieval quality — English NanoBEIR (4 datasets, 50 queries each)

Dataset	NDCG@10 (fp16)	NDCG@10 (int8)	Recall@10 (fp16)	Recall@10 (int8)
NanoNQ	0.7861	0.7841	0.8500	0.8500
NanoFiQA2018	0.5968	0.6006	0.6664	0.6664
NanoSciFact	0.7948	0.8035	0.9100	0.9100
NanoNFCorpus	0.3998	0.4012	0.1608	0.1603
Mean	0.6444	0.6473	0.6468	0.6467

Mean NDCG@10 retention vs fp16: 100.5% (fp16 0.6444 → int8 0.6473). fp16 baseline encode: 13149 tok/s.

Benchmarked on a fixed 4-dataset English NanoBEIR subset to measure the quantization quality delta vs fp16 (not the full multilingual suite — see the base model card for published numbers).

Usage (MLX)

# pip install mlx mlx-lm transformers
# This repo bundles `mlx_lfm2_encoder.py` — the bidirectional LFM2 encoder
# (CLS pooling / ColBERT MaxSim) that the stock causal LFM2 loaders do NOT provide.
import mlx.core as mx
from transformers import AutoTokenizer
from mlx_lfm2_encoder import load_model

tok = AutoTokenizer.from_pretrained(".", trust_remote_code=True)
model, _ = load_model(".", head="colbert")     # head: "embedding" or "colbert"

# ColBERT late interaction: per-token 128-d vectors + MaxSim
import numpy as np
def encode(texts, prefix):
    enc = tok([prefix + t for t in texts], return_tensors="np", padding=True,
              truncation=True, max_length=512)
    out = model(mx.array(enc["input_ids"]), mx.array(enc["attention_mask"]))
    mx.eval(out); arr = np.array(out.astype(mx.float32))
    mask = enc["attention_mask"].astype(bool)
    return [arr[i][mask[i]] for i in range(arr.shape[0])]  # list of (Li, 128)

q = encode(["who wrote hamlet"], "[Q] ")[0]
d = encode(["Hamlet is a tragedy written by William Shakespeare ..."], "[D] ")[0]
maxsim = (q @ d.T).max(axis=1).sum()  # late-interaction score

Why a bundled loader?

These are bidirectional encoders (non-causal attention + non-causal short-conv + per-token MaxSim). General-purpose causal LFM2 loaders produce wrong embeddings here, so this repo ships mlx_lfm2_encoder.py (validated to cosine ≥ 0.999 against the original transformers model). For the broader MLX embedding ecosystem see mlx-embeddings.

License

Inherits the LFM Open License v1.0 (lfm1.0) from the base model.

Downloads last month: 75

Safetensors

Model size

99.6M params

Tensor type

BF16

U32

MLX

Hardware compatibility

Quantized

Model tree for sahilchachra/LFM2.5-ColBERT-350M-int8

Base model

LiquidAI/LFM2.5-350M-Base

Finetuned

LiquidAI/LFM2.5-ColBERT-350M

Finetuned

(11)

this model

Collection including sahilchachra/LFM2.5-ColBERT-350M-int8

LFM2.5 ColBERT 350M

Collection

5 items • Updated 17 days ago