Instructions to use sahilchachra/LFM2.5-ColBERT-350M-int8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use sahilchachra/LFM2.5-ColBERT-350M-int8 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir LFM2.5-ColBERT-350M-int8 sahilchachra/LFM2.5-ColBERT-350M-int8
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
LFM2.5-ColBERT-350M-int8 (MLX, affine)
MLX affine (group_size=64, 8-bit) quantization of LiquidAI/LFM2.5-ColBERT-350M,
a bidirectional LFM2.5 late-interaction (128-d per-token, MaxSim) for multilingual retrieval. Runs on Apple Silicon via MLX.
- Weights: 376 MB (affine (group_size=64, 8-bit))
- Encode throughput: 12978 tokens/sec (M-series, this benchmark)
- Pooling / scoring: per-token Dense(1024→128), MaxSim late interaction
- Prompts:
[Q]/[D]markers (required — trained with them)
Retrieval quality — English NanoBEIR (4 datasets, 50 queries each)
| Dataset | NDCG@10 (fp16) | NDCG@10 (int8) | Recall@10 (fp16) | Recall@10 (int8) |
|---|---|---|---|---|
| NanoNQ | 0.7861 | 0.7841 | 0.8500 | 0.8500 |
| NanoFiQA2018 | 0.5968 | 0.6006 | 0.6664 | 0.6664 |
| NanoSciFact | 0.7948 | 0.8035 | 0.9100 | 0.9100 |
| NanoNFCorpus | 0.3998 | 0.4012 | 0.1608 | 0.1603 |
| Mean | 0.6444 | 0.6473 | 0.6468 | 0.6467 |
Mean NDCG@10 retention vs fp16: 100.5% (fp16 0.6444 → int8 0.6473). fp16 baseline encode: 13149 tok/s.
Benchmarked on a fixed 4-dataset English NanoBEIR subset to measure the quantization quality delta vs fp16 (not the full multilingual suite — see the base model card for published numbers).
Usage (MLX)
# pip install mlx mlx-lm transformers
# This repo bundles `mlx_lfm2_encoder.py` — the bidirectional LFM2 encoder
# (CLS pooling / ColBERT MaxSim) that the stock causal LFM2 loaders do NOT provide.
import mlx.core as mx
from transformers import AutoTokenizer
from mlx_lfm2_encoder import load_model
tok = AutoTokenizer.from_pretrained(".", trust_remote_code=True)
model, _ = load_model(".", head="colbert") # head: "embedding" or "colbert"
# ColBERT late interaction: per-token 128-d vectors + MaxSim
import numpy as np
def encode(texts, prefix):
enc = tok([prefix + t for t in texts], return_tensors="np", padding=True,
truncation=True, max_length=512)
out = model(mx.array(enc["input_ids"]), mx.array(enc["attention_mask"]))
mx.eval(out); arr = np.array(out.astype(mx.float32))
mask = enc["attention_mask"].astype(bool)
return [arr[i][mask[i]] for i in range(arr.shape[0])] # list of (Li, 128)
q = encode(["who wrote hamlet"], "[Q] ")[0]
d = encode(["Hamlet is a tragedy written by William Shakespeare ..."], "[D] ")[0]
maxsim = (q @ d.T).max(axis=1).sum() # late-interaction score
Why a bundled loader?
These are bidirectional encoders (non-causal attention + non-causal short-conv +
per-token MaxSim). General-purpose causal LFM2 loaders produce
wrong embeddings here, so this repo ships mlx_lfm2_encoder.py (validated to cosine ≥ 0.999
against the original transformers model). For the broader MLX embedding ecosystem see
mlx-embeddings.
License
Inherits the LFM Open License v1.0 (lfm1.0) from the base model.
- Downloads last month
- 75
Quantized
Model tree for sahilchachra/LFM2.5-ColBERT-350M-int8
Base model
LiquidAI/LFM2.5-350M-Base