Instructions to use sahilchachra/LFM2.5-Embedding-350M-fp16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use sahilchachra/LFM2.5-Embedding-350M-fp16 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir LFM2.5-Embedding-350M-fp16 sahilchachra/LFM2.5-Embedding-350M-fp16
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
LFM2.5-Embedding-350M-fp16 (MLX, fp16)
MLX fp16 (unquantized) quantization of LiquidAI/LFM2.5-Embedding-350M,
a bidirectional LFM2.5 dense bi-encoder (1024-d CLS vector) for multilingual retrieval. Runs on Apple Silicon via MLX.
- Weights: 709 MB (fp16 (unquantized))
- Encode throughput: 14496 tokens/sec (M-series, this benchmark)
- Pooling / scoring: CLS token, cosine similarity
- Prompts:
query:/document:(required — trained with them)
Retrieval quality — English NanoBEIR (4 datasets, 50 queries each)
| Dataset | NDCG@10 | Recall@10 |
|---|---|---|
| NanoNQ | 0.7135 | 0.8000 |
| NanoFiQA2018 | 0.5632 | 0.6443 |
| NanoSciFact | 0.7461 | 0.8500 |
| NanoNFCorpus | 0.3645 | 0.1583 |
| Mean | 0.5968 | 0.6131 |
Benchmarked on a fixed 4-dataset English NanoBEIR subset to measure the quantization quality delta vs fp16 (not the full multilingual suite — see the base model card for published numbers).
Usage (MLX)
# pip install mlx mlx-lm transformers
# This repo bundles `mlx_lfm2_encoder.py` — the bidirectional LFM2 encoder
# (CLS pooling / ColBERT MaxSim) that the stock causal LFM2 loaders do NOT provide.
import mlx.core as mx
from transformers import AutoTokenizer
from mlx_lfm2_encoder import load_model
tok = AutoTokenizer.from_pretrained(".", trust_remote_code=True)
model, _ = load_model(".", head="embedding") # head: "embedding" or "colbert"
# Asymmetric prompts (REQUIRED — the model was trained with them):
def encode(texts, prefix):
enc = tok([prefix + t for t in texts], return_tensors="np", padding=True,
truncation=True, max_length=512)
out = model(mx.array(enc["input_ids"]), mx.array(enc["attention_mask"]))
mx.eval(out)
return out # (B, 1024) CLS-pooled, L2-normalized
q = encode(["was the nightmare before christmas a disney film"], "query: ")
d = encode(["The Nightmare Before Christmas is a 1993 stop-motion film ..."], "document: ")
scores = (q @ d.T) # cosine similarity
Why a bundled loader?
These are bidirectional encoders (non-causal attention + non-causal short-conv +
CLS pooling). General-purpose causal LFM2 loaders produce
wrong embeddings here, so this repo ships mlx_lfm2_encoder.py (validated to cosine ≥ 0.999
against the original transformers model). For the broader MLX embedding ecosystem see
mlx-embeddings.
License
Inherits the LFM Open License v1.0 (lfm1.0) from the base model.
- Downloads last month
- 68
Quantized
Model tree for sahilchachra/LFM2.5-Embedding-350M-fp16
Base model
LiquidAI/LFM2.5-350M-Base