metadata

license: apache-2.0
language:
  - en
tags:
  - ColBERT
  - passage-retrieval
datasets:
  - ms_marco

Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications.

Trained by Jina AI.

Jina-ColBERT

Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length and fast and accurate retrieval.

JinaBERT is a BERT architecture that supports the symmetric bidirectional variant of ALiBi to allow longer sequence length. The Jina-ColBERT model is trained on MSMARCO passage ranking dataset, following a very similar training procedure with ColBERTv2. The only difference is that we use jina-bert-v2-base-en as the backbone instead of bert-base-uncased.

For more information about ColBERT, please refer to the ColBERTv1 and ColBERTv2 paper, and the original code.

Usage

We strongly recommend following the same usage as the original ColBERT to use this model.

Installation

To use this model, you will need to install the latest version of the ColBERT repository (if not the latest version the ColBERT code may not support models that use the custom code and cause an assertion error):

pip install git+https://github.com/stanford-futuredata/ColBERT.git torch
conda install -c conda-forge faiss-gpu  # use conda to install the latest version faiss

Indexing

from colbert import Indexer
from colbert.infra import Run, RunConfig, ColBERTConfig

n_gpu: int = 1  # Set your number of available GPUs
experiment: str = ""  # Name of the folder where the logs and created indices will be stored
index_name: str = ""  # The name of your index, i.e. the name of your vector database

if __name__ == "__main__":
    with Run().context(RunConfig(nranks=n_gpu, experiment=experiment)):
        config = ColBERTConfig(
          doc_maxlen=8192  # Our model supports 8k context length for indexing long documents
        )
        indexer = Indexer(
          checkpoint="jinaai/jina-colbert-v1-en",
          config=config,
        )
        documents = [
          "ColBERT is an efficient and effective passage retrieval model.",
          "Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length.",
          "JinaBERT is a BERT architecture that supports the symmetric bidirectional variant of ALiBi to allow longer sequence length.",
          "Jina-ColBERT model is trained on MSMARCO passage ranking dataset, following a very similar training procedure with ColBERTv2.",
          "Jina-ColBERT achieves the competitive retrieval performance with ColBERTv2.",
          "Jina is an easier way to build neural search systems.",
          "You can use Jina-ColBERT to build neural search systems with ease.",
          # Add more documents here to ensure the clustering work correctly
        ]
        indexer.index(name=index_name, collection=documents)

Creating Vectors

from colbert.modeling.checkpoint import Checkpoint
ckpt = Checkpoint("jinaai/jina-colbert-v1-en", colbert_config=ColBERTConfig(root="experiments"))
queries = ckpt.queryFromText(["What does ColBERT do?", "This is a search query?"], bsize=16)
document_vectors = ckpt.docFromText(documents, bsize=32)[0]

Complete working Colab Notebook is here

Searching

from colbert import Searcher
from colbert.infra import Run, RunConfig, ColBERTConfig

n_gpu: int = 0
experiment: str = ""  # Name of the folder where the logs and created indices will be stored
index_name: str = ""  # Name of your previously created index where the documents you want to search are stored.
k: int = 10  # how many results you want to retrieve

if __name__ == "__main__":
    with Run().context(RunConfig(nranks=n_gpu, experiment=experiment)):
        config = ColBERTConfig(
          query_maxlen=128  # Although the model supports 8k context length, we suggest not to use a very long query, as it may cause significant computational complexity and CUDA memory usage.
        )
        searcher = Searcher(
          index=index_name, 
          config=config
        )  # You don't need to specify the checkpoint again, the model name is stored in the index.
        query = "How to use ColBERT for indexing long documents?"
        results = searcher.search(query, k=k)
        # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)

Evaluation Results

TL;DR: Our Jina-ColBERT achieves the competitive retrieval performance with ColBERTv2 on all benchmarks, and outperforms ColBERTv2 on datasets in where documents have longer context length.

In-domain benchmarks

We evaluate the in-domain performance on the dev subset of MSMARCO passage ranking dataset. We follow the same evaluation settings in the ColBERTv2 paper and rerun the results of ColBERTv2 using the released checkpoint.

Model	MRR@10	Recall@50	Recall@1k
ColBERTv2	39.7	86.8	97.6
Jina-ColBERT-v1	39.0	85.6	96.2

Out-of-domain benchmarks

Following ColBERTv2, we evaluate the out-of-domain performance on 13 public BEIR datasets and use NDCG@10 as the main metric. We follow the same evaluation settings in the ColBERTv2 paper and rerun the results of ColBERTv2 using the released checkpoint.

Note that both ColBERTv2 and Jina-ColBERT-v1 only employ MSMARCO passage ranking dataset for training, so below results are the fully zero-shot performance.

dataset	ColBERTv2	Jina-ColBERT-v1
ArguAna	46.5	49.4
ClimateFEVER	18.1	19.6
DBPedia	45.2	41.3
FEVER	78.8	79.5
FiQA	35.4	36.8
HotPotQA	67.5	65.6
NFCorpus	33.7	33.8
NQ	56.1	54.9
Quora	85.5	82.3
SCIDOCS	15.4	16.9
SciFact	68.9	70.1
TREC-COVID	72.6	75.0
Webis-touché2020	26.0	27.0
Average	50.0	50.2

Long context datasets

We also evaluate the zero-shot performance on datasets where documents have longer context length and compare with some long-context embedding models. Here we use the LoCo benchmark, which contains 5 datasets with long context length.

Model	Used context length	Model max context length	Avg. NDCG@10
ColBERTv2	512	512	74.3
Jina-ColBERT-v1 (truncated)	512*	8192	75.5
Jina-ColBERT-v1	8192	8192	83.7
Jina-embeddings-v2-base-en	8192	8192	85.4

* denotes that we truncate the context length to the length of 512 for document but the query length is still 512.

To summarize, Jina-ColBERT achieves the comparable performance with ColBERTv2 on all benchmarks, and outperforms ColBERTv2 on datasets in where documents have longer context length.

Plans

We will evaluate the performance of Jina-ColBERT as a reranker in a retrieval pipeline, and add the usage examples.
We are planning to improve the performance of Jina-ColBERT by fine-tuning on more datasets in the future!

Other Models

Additionally, we provide the following embedding models, you can also use them for retrieval.

jina-embeddings-v2-base-en: 137 million parameters.
jina-embeddings-v2-base-zh: 161 million parameters Chinese-English bilingual model.
jina-embeddings-v2-base-de: 161 million parameters German-English bilingual model.
jina-embeddings-v2-base-es: 161 million parameters Spanish-English bilingual model (soon).

Contact

Join our Discord community and chat with other community members about ideas.