license: cc-by-4.0
language:
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- 'no'
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh
tags:
- ColBERT
- passage-retrieval
Trained by Jina AI.
JinaColBERT V2: your multilingual late interaction retriever!
JinaColBERT V2 (jina-colbert-v2
) is a new model based on the JinaColBERT V1 that expands on the capabilities and performance of the jina-colbert-v1-en
model. Like the previous release, it has Jina AI’s 8192 token input context and the improved efficiency, performance, and explainability of token-level embeddings and late interaction.
This new release adds new functionality and performance improvements:
- Multilingual support for dozens of languages, with strong performance on major global languages.
- Matryoshka embeddings, which allow users to trade between efficiency and precision flexibly.
- Superior retrieval performance when compared to the English-only
jina-colbert-v1-en
.
JinaColBERT V2 offers three different versions for different embeddings dimensions:
jinaai/jina-colbert-v2
: 128 dimension embeddings
jinaai/jina-colbert-v2-96
: 96 dimension embeddings
jinaai/jina-colbert-v2-64
: 64 dimension embeddings
Usage
Installation
jina-colbert-v2
is trained with flash attention and therefore requires einops
and flash_attn
to be installed.
To use the model, you could either use the Standford ColBERT library or use the ragatouille
package that we provide.
pip install -U einops flash_attn
pip install -U ragatouille
pip install -U colbert-ai
RAGatouille
from ragatouille import RAGPretrainedModel
RAG = RAGPretrainedModel.from_pretrained("jinaai/jina-colbert-v2")
docs = [
"ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.",
"Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length, fast and accurate retrieval.",
]
RAG.index(docs, index_name="demo")
query = "What does ColBERT do?"
results = RAG.search(query)
Stanford ColBERT
from colbert.infra import ColBERTConfig
from colbert.modeling.checkpoint import Checkpoint
ckpt = Checkpoint("jinaai/jina-colbert-v2", colbert_config=ColBERTConfig())
docs = [
"ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.",
"Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length, fast and accurate retrieval.",
]
query_vectors = ckpt.queryFromText(docs, bsize=2)
Evaluation Results
Retrieval Benchmarks
BEIR
NDCG@10 | jina-colbert-v2 | jina-colbert-v1 | ColBERTv2.0 | BM25 |
---|---|---|---|---|
avg | 0.531 | 0.502 | 0.496 | 0.440 |
nfcorpus | 0.346 | 0.338 | 0.337 | 0.325 |
fiqa | 0.408 | 0.368 | 0.354 | 0.236 |
trec-covid | 0.834 | 0.750 | 0.726 | 0.656 |
arguana | 0.366 | 0.494 | 0.465 | 0.315 |
quora | 0.887 | 0.823 | 0.855 | 0.789 |
scidocs | 0.186 | 0.169 | 0.154 | 0.158 |
scifact | 0.678 | 0.701 | 0.689 | 0.665 |
webis-touche | 0.274 | 0.270 | 0.260 | 0.367 |
dbpedia-entity | 0.471 | 0.413 | 0.452 | 0.313 |
fever | 0.805 | 0.795 | 0.785 | 0.753 |
climate-fever | 0.239 | 0.196 | 0.176 | 0.213 |
hotpotqa | 0.766 | 0.656 | 0.675 | 0.603 |
nq | 0.640 | 0.549 | 0.524 | 0.329 |
MS MARCO Passage Retrieval
MRR@10 | jina-colbert-v2 | jina-colbert-v1 | ColBERTv2.0 | BM25 |
---|---|---|---|---|
MSMARCO | 0.396 | 0.390 | 0.397 | 0.187 |
Multilingual Benchmarks
MIRACLE
NDCG@10 | jina-colbert-v2 | mDPR (zero shot) |
---|---|---|
avg | 0.627 | 0.427 |
ar | 0.753 | 0.499 |
bn | 0.750 | 0.443 |
de | 0.504 | 0.490 |
es | 0.538 | 0.478 |
en | 0.570 | 0.394 |
fa | 0.563 | 0.480 |
fi | 0.740 | 0.472 |
fr | 0.541 | 0.435 |
hi | 0.600 | 0.383 |
id | 0.547 | 0.272 |
ja | 0.632 | 0.439 |
ko | 0.671 | 0.419 |
ru | 0.643 | 0.407 |
sw | 0.499 | 0.299 |
te | 0.742 | 0.356 |
th | 0.772 | 0.358 |
yo | 0.623 | 0.396 |
zh | 0.523 | 0.512 |
mMARCO
MRR@10 | jina-colbert-v2 | BM-25 | ColBERT-XM |
---|---|---|---|
avg | 0.313 | 0.141 | 0.254 |
ar | 0.272 | 0.111 | 0.195 |
de | 0.331 | 0.136 | 0.270 |
nl | 0.330 | 0.140 | 0.275 |
es | 0.341 | 0.158 | 0.285 |
fr | 0.335 | 0.155 | 0.269 |
hi | 0.309 | 0.134 | 0.238 |
id | 0.319 | 0.149 | 0.263 |
it | 0.337 | 0.153 | 0.265 |
ja | 0.276 | 0.141 | 0.241 |
pt | 0.337 | 0.152 | 0.276 |
ru | 0.298 | 0.124 | 0.251 |
vi | 0.287 | 0.136 | 0.226 |
zh | 0.302 | 0.116 | 0.246 |
Matryoshka Representation Benchmarks
BEIR
NDCG@10 | dim=128 | dim=96 | dim=64 |
---|---|---|---|
avg | 0.599 | 0.591 | 0.589 |
nfcorpus | 0.346 | 0.340 | 0.347 |
fiqa | 0.408 | 0.404 | 0.404 |
trec-covid | 0.834 | 0.808 | 0.805 |
hotpotqa | 0.766 | 0.764 | 0.756 |
nq | 0.640 | 0.640 | 0.635 |
MSMARCO
MRR@10 | dim=128 | dim=96 | dim=64 |
---|---|---|---|
msmarco | 0.396 | 0.391 | 0.388 |
Other Models
Additionally, we provide the following embedding models, you can also use them for retrieval.
jina-embeddings-v2-base-en
: 137 million parameters.jina-embeddings-v2-base-zh
: 161 million parameters Chinese-English bilingual model.jina-embeddings-v2-base-de
: 161 million parameters German-English bilingual model.jina-embeddings-v2-base-es
: 161 million parameters Spanish-English bilingual model.jina-reranker-v2
: multilingual reranker model.jina-clip-v1
: English multimodal (text-image) embedding model.
Contact
Join our Discord community and chat with other community members about ideas.