Instructions to use LiquidAI/LFM2.5-ColBERT-350M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use LiquidAI/LFM2.5-ColBERT-350M with sentence-transformers:
from pylate import models queries = [ "Which planet is known as the Red Planet?", "What is the largest planet in our solar system?", ] documents = [ ["Mars is the Red Planet.", "Venus is Earth's twin."], ["Jupiter is the largest planet.", "Saturn has rings."], ] model = models.ColBERT(model_name_or_path="LiquidAI/LFM2.5-ColBERT-350M") queries_emb = model.encode(queries, is_query=True) docs_emb = model.encode(documents, is_query=False) - Notebooks
- Google Colab
- Kaggle
LFM2.5-ColBERT-350M
We release two new best-in-class multilingual retrieval models:
- LFM2.5-Embedding-350M — A dense bi-encoder, one vector per document. Smallest, fastest index.
- LFM2.5-ColBERT-350M — A late-interaction model. One vector per token, matched via MaxSim. Higher accuracy and better generalization at the cost of index size.
Both models are 350M params and the first bidirectional members of the LFM family, built on LFM2.5-350M-Base. They can be used as a drop-in replacement for your current RAG pipeline and target fast, cheap, and reliable multilingual / cross-lingual search across 11 languages.
Find more details about the bidirectional architecture and training recipe in our blog post.
💻 Demo: https://huggingface.co/spaces/LiquidAI/colbert-tool-selection
📄 Model details
| Property | LFM2.5-ColBERT-350M | LFM2.5-Embedding-350M |
|---|---|---|
| Type | Late interaction (per-token vectors) | Dense bi-encoder (single vector) |
| Total parameters | ~353M | ~354M |
| Backbone | LFM2.5-350M-Base + bi-directional patches | LFM2.5-350M-Base + bi-directional patches |
| Layers | 17 (10 conv + 6 attn + 1 dense) | 17 (10 conv + 6 attn + 1 pool) |
| Context length | 32,768 tokens | 32,768 tokens |
| Vocabulary size | 64,402 | 65,536 |
| Output | 128-dim per token | 1024-dim CLS vector |
| Similarity | MaxSim | Cosine |
| Training precision | BF16 | BF16 |
| License | LFM Open License v1.0 | LFM Open License v1.0 |
Document length: 512 tokens
Query length: 32 tokens
Supported languages: English, Spanish, German, French, Italian, Portuguese, Arabic, Swedish, Norwegian, Japanese, Korean.
Architecture:
ColBERT(
(0): Transformer({'max_seq_length': 511, 'do_lower_case': False}) with Transformer model: Lfm2BidirectionalModel
(1): Dense({'in_features': 1024, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
)
We recommend LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M for short-context retrieval use cases, such as:
- E-commerce: find products across many languages with semantic search at scale.
- FAQ and support knowledge bases: retrieve the right answer reliably across customer-facing surfaces.
- On-device semantic search: search files, emails, and notes locally on consumer hardware.
- Enterprise knowledge assistants: retrieve internal legal, financial, and technical documents across languages.
🏃 How to run
First, install the PyLate and transformers libraries:
pip install -U pylate
Retrieval
Use this model with PyLate to index and retrieve documents. The index uses FastPLAID for efficient similarity search.
Indexing documents
Load LFM2.5-ColBERT-350M and initialize the PLAID index, then encode and index your documents:
from pylate import indexes, models, retrieve
# Step 1: Load the ColBERT model (trust_remote_code applies the bidirectional patches)
model = models.ColBERT(
model_name_or_path="LiquidAI/LFM2.5-ColBERT-350M",
trust_remote_code=True,
)
model.tokenizer.pad_token = model.tokenizer.eos_token
# Step 2: Initialize the PLAID index
index = indexes.PLAID(
index_folder="pylate-index",
index_name="index",
override=True, # This overwrites the existing index if any
)
# Step 3: Encode the documents
documents_ids = ["1", "2", "3"]
documents = ["document 1 text", "document 2 text", "document 3 text"]
documents_embeddings = model.encode(
documents,
batch_size=32,
is_query=False, # Ensure that it is set to False to indicate that these are documents, not queries
show_progress_bar=True,
)
# Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
index.add_documents(
documents_ids=documents_ids,
documents_embeddings=documents_embeddings,
)
Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:
# To load an index, simply instantiate it with the correct folder/name and without overriding it
index = indexes.PLAID(
index_folder="pylate-index",
index_name="index",
)
Retrieving top-k documents for queries
Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries. To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries, and then retrieve the top-k documents to get the top matches ids and relevance scores:
# Step 1: Initialize the ColBERT retriever
retriever = retrieve.ColBERT(index=index)
# Step 2: Encode the queries
queries_embeddings = model.encode(
["query for document 3", "query for document 1"],
batch_size=32,
is_query=True, # Ensure that it is set to True to indicate that these are queries
show_progress_bar=True,
)
# Step 3: Retrieve top-k documents
scores = retriever.retrieve(
queries_embeddings=queries_embeddings,
k=10, # Retrieve the top 10 matches for each query
)
Reranking
If you only want to use LFM2.5-ColBERT-350M to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use the rank function and pass the queries and documents to rerank:
from pylate import rank, models
queries = [
"query A",
"query B",
]
documents = [
["document A", "document B"],
["document 1", "document C", "document B"],
]
documents_ids = [
[1, 2],
[1, 3, 2],
]
model = models.ColBERT(
model_name_or_path="LiquidAI/LFM2.5-ColBERT-350M",
trust_remote_code=True,
)
queries_embeddings = model.encode(
queries,
is_query=True,
)
documents_embeddings = model.encode(
documents,
is_query=False,
)
reranked_documents = rank.rerank(
documents_ids=documents_ids,
queries_embeddings=queries_embeddings,
documents_embeddings=documents_embeddings,
)
📈 Performance
We highlight (= bold) the best bi-encoder and best late retriever for each language.
NanoBEIR Multilingual Extended — NDCG@10
LiquidAI/nanobeir-multilingual-extended. Multilingual retrieval capabilities.
| Model | Type | AVG | ar | de | en | es | fr | it | ja | ko | no | pt | sv |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LiquidAI/LFM2.5-ColBERT-350M | late | 0.605 | 0.551 | 0.606 | 0.687 | 0.607 | 0.622 | 0.606 | 0.614 | 0.590 | 0.570 | 0.613 | 0.586 |
| LiquidAI/LFM2.5-Embedding-350M | dense | 0.577 | 0.529 | 0.581 | 0.644 | 0.581 | 0.592 | 0.583 | 0.575 | 0.563 | 0.557 | 0.581 | 0.566 |
| Qwen/Qwen3-Embedding-0.6B | dense | 0.556 | 0.514 | 0.560 | 0.649 | 0.568 | 0.565 | 0.565 | 0.551 | 0.530 | 0.516 | 0.571 | 0.525 |
| LiquidAI/LFM2-ColBERT-350M | late | 0.540 | 0.491 | 0.563 | 0.661 | 0.563 | 0.564 | 0.543 | 0.557 | 0.527 | 0.449 | 0.547 | 0.480 |
| Alibaba-NLP/gte-multilingual-base | dense | 0.528 | 0.477 | 0.523 | 0.624 | 0.537 | 0.542 | 0.528 | 0.511 | 0.494 | 0.516 | 0.534 | 0.526 |
| lightonai/GTE-ModernColBERT-v1 | late | 0.489 | 0.309 | 0.499 | 0.680 | 0.525 | 0.546 | 0.516 | 0.459 | 0.368 | 0.465 | 0.530 | 0.483 |
| lightonai/LateOn | late | 0.484 | 0.307 | 0.505 | 0.690 | 0.531 | 0.537 | 0.514 | 0.442 | 0.326 | 0.465 | 0.533 | 0.475 |
| lightonai/DenseOn | dense | 0.432 | 0.178 | 0.474 | 0.676 | 0.496 | 0.520 | 0.487 | 0.378 | 0.197 | 0.422 | 0.493 | 0.433 |
| Alibaba-NLP/gte-modernbert-base | dense | 0.383 | 0.112 | 0.449 | 0.666 | 0.448 | 0.475 | 0.408 | 0.275 | 0.180 | 0.376 | 0.431 | 0.391 |
| BAAI/bge-large-en-v1.5 | dense | 0.359 | 0.059 | 0.419 | 0.642 | 0.445 | 0.475 | 0.431 | 0.198 | 0.132 | 0.358 | 0.434 | 0.353 |
MKQA-11 — Recall@20
MKQA. Cross-lingual capabilities (subset of the 11 languages we target).
| Model | Type | AVG | ar | de | en | es | fr | it | ja | ko | no | pt | sv |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LiquidAI/LFM2.5-ColBERT-350M | late | 0.694 | 0.608 | 0.709 | 0.748 | 0.711 | 0.715 | 0.707 | 0.703 | 0.640 | 0.689 | 0.703 | 0.700 |
| LiquidAI/LFM2.5-Embedding-350M | dense | 0.691 | 0.610 | 0.709 | 0.738 | 0.708 | 0.715 | 0.703 | 0.685 | 0.630 | 0.691 | 0.710 | 0.708 |
| Alibaba-NLP/gte-multilingual-base | dense | 0.675 | 0.567 | 0.692 | 0.741 | 0.705 | 0.703 | 0.697 | 0.655 | 0.563 | 0.698 | 0.700 | 0.699 |
| LiquidAI/LFM2-ColBERT-350M | late | 0.646 | 0.554 | 0.696 | 0.754 | 0.711 | 0.710 | 0.667 | 0.658 | 0.558 | 0.541 | 0.669 | 0.589 |
| Qwen/Qwen3-Embedding-0.6B | dense | 0.638 | 0.520 | 0.671 | 0.723 | 0.678 | 0.672 | 0.671 | 0.635 | 0.543 | 0.620 | 0.667 | 0.620 |
| lightonai/GTE-ModernColBERT-v1 | late | 0.459 | 0.092 | 0.532 | 0.754 | 0.552 | 0.615 | 0.510 | 0.275 | 0.166 | 0.503 | 0.524 | 0.524 |
| lightonai/LateOn | late | 0.454 | 0.157 | 0.492 | 0.755 | 0.537 | 0.577 | 0.481 | 0.316 | 0.209 | 0.472 | 0.502 | 0.501 |
| lightonai/DenseOn | dense | 0.435 | 0.165 | 0.482 | 0.751 | 0.491 | 0.553 | 0.457 | 0.325 | 0.222 | 0.438 | 0.443 | 0.453 |
| BAAI/bge-large-en-v1.5 | dense | 0.413 | 0.133 | 0.471 | 0.748 | 0.450 | 0.531 | 0.461 | 0.208 | 0.172 | 0.456 | 0.443 | 0.467 |
| Alibaba-NLP/gte-modernbert-base | dense | 0.295 | 0.060 | 0.333 | 0.736 | 0.273 | 0.417 | 0.291 | 0.100 | 0.052 | 0.332 | 0.326 | 0.330 |
Inference speed - llama.cpp
End-to-end latency on MacBook Pro M4 Max via llama.cpp at fp16, measured at 32-token queries and 256-token documents. Docs cached means that the document embeddings are pre-computed and looked up (from an index).
| Model | Stage | Docs cached | p50 | p95 |
|---|---|---|---|---|
| LFM2.5-Embedding-350M | Query embedding | yes | 7.3 ms | 9.6 ms |
| LFM2.5-ColBERT-350M | Query embedding | yes | 8.1 ms | 8.5 ms |
| LFM2.5-ColBERT-350M | Query embedding + MaxSim | yes | 8.2 ms | 15.2 ms |
| LFM2.5-ColBERT-350M | Query embedding + Doc embedding + MaxSim | no | 34.3 ms | 36.3 ms |
Both models LiquidAI/LFM2.5-ColBERT-350M-GGUF and LiquidAI/LFM2.5-Embedding-350M-GGUF are available on Hugging Face under different quantization schemas for llama.cpp.
Inference speed - Enterprise GPU
For large-scale production-grade enterprise deployments, we also experiment with an internal GPU stack to deliver extremely low-latency serving under high inbound load. We observe latencies as low as 1 ms.
| Workload | Setup | p50 | p95 | p99 |
|---|---|---|---|---|
| LFM2.5-Embedding-350M | Query embedding | 1.5 ms | 1.6 ms | 1.7 ms |
| LFM2.5-ColBERT-350M | Query embedding | 1.3 ms | 1.4 ms | 1.5 ms |
| LFM2.5-ColBERT-350M | Query embedding + MaxSim | 2.5 ms | 2.7 ms | 2.8 ms |
| LFM2.5-ColBERT-350M | Query embedding + Doc embedding + MaxSim | 22.8 ms | 24.1 ms | 26.4 ms |
📬 Contact
- Got questions or want to connect? Join our Discord community.
- If you are interested in custom solutions with edge deployment, please contact our sales team.
Citation
@article{liquidai2025lfm2,
title={LFM2 Technical Report},
author={Liquid AI},
journal={arXiv preprint arXiv:2511.23404},
year={2025}
}
@misc{PyLate,
title={PyLate: Flexible Training and Retrieval for Late Interaction Models},
author={Chaffin, Antoine and Sourty, Raphaël},
url={https://github.com/lightonai/pylate},
year={2024}
}
- Downloads last month
- 103


