Instructions to use sahilchachra/LFM2.5-ColBERT-350M-fp16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use sahilchachra/LFM2.5-ColBERT-350M-fp16 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir LFM2.5-ColBERT-350M-fp16 sahilchachra/LFM2.5-ColBERT-350M-fp16
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
LFM2.5-ColBERT-350M-fp16 (MLX, fp16)
MLX fp16 (unquantized) quantization of LiquidAI/LFM2.5-ColBERT-350M,
a bidirectional LFM2.5 late-interaction (128-d per-token, MaxSim) for multilingual retrieval. Runs on Apple Silicon via MLX.
- Weights: 707 MB (fp16 (unquantized))
- Encode throughput: 13149 tokens/sec (M-series, this benchmark)
- Pooling / scoring: per-token Dense(1024→128), MaxSim late interaction
- Prompts:
[Q]/[D]markers (required — trained with them)
Retrieval quality — English NanoBEIR (4 datasets, 50 queries each)
| Dataset | NDCG@10 | Recall@10 |
|---|---|---|
| NanoNQ | 0.7861 | 0.8500 |
| NanoFiQA2018 | 0.5968 | 0.6664 |
| NanoSciFact | 0.7948 | 0.9100 |
| NanoNFCorpus | 0.3998 | 0.1608 |
| Mean | 0.6444 | 0.6468 |
Benchmarked on a fixed 4-dataset English NanoBEIR subset to measure the quantization quality delta vs fp16 (not the full multilingual suite — see the base model card for published numbers).
Usage (MLX)
# pip install mlx mlx-lm transformers
# This repo bundles `mlx_lfm2_encoder.py` — the bidirectional LFM2 encoder
# (CLS pooling / ColBERT MaxSim) that the stock causal LFM2 loaders do NOT provide.
import mlx.core as mx
from transformers import AutoTokenizer
from mlx_lfm2_encoder import load_model
tok = AutoTokenizer.from_pretrained(".", trust_remote_code=True)
model, _ = load_model(".", head="colbert") # head: "embedding" or "colbert"
# ColBERT late interaction: per-token 128-d vectors + MaxSim
import numpy as np
def encode(texts, prefix):
enc = tok([prefix + t for t in texts], return_tensors="np", padding=True,
truncation=True, max_length=512)
out = model(mx.array(enc["input_ids"]), mx.array(enc["attention_mask"]))
mx.eval(out); arr = np.array(out.astype(mx.float32))
mask = enc["attention_mask"].astype(bool)
return [arr[i][mask[i]] for i in range(arr.shape[0])] # list of (Li, 128)
q = encode(["who wrote hamlet"], "[Q] ")[0]
d = encode(["Hamlet is a tragedy written by William Shakespeare ..."], "[D] ")[0]
maxsim = (q @ d.T).max(axis=1).sum() # late-interaction score
Why a bundled loader?
These are bidirectional encoders (non-causal attention + non-causal short-conv +
per-token MaxSim). General-purpose causal LFM2 loaders produce
wrong embeddings here, so this repo ships mlx_lfm2_encoder.py (validated to cosine ≥ 0.999
against the original transformers model). For the broader MLX embedding ecosystem see
mlx-embeddings.
License
Inherits the LFM Open License v1.0 (lfm1.0) from the base model.
- Downloads last month
- 43
Quantized
Model tree for sahilchachra/LFM2.5-ColBERT-350M-fp16
Base model
LiquidAI/LFM2.5-350M-Base