|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
tags: |
|
- ColBERT |
|
- passage-retrieval |
|
datasets: |
|
- ms_marco |
|
--- |
|
|
|
<br><br> |
|
|
|
<p align="center"> |
|
<img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px"> |
|
</p> |
|
|
|
|
|
<p align="center"> |
|
<b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b> |
|
</p> |
|
|
|
# Jina-ColBERT |
|
|
|
### Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both _8k context length_ and _fast and accurate retrieval_. |
|
|
|
[JinaBERT](https://arxiv.org/abs/2310.19923) is a BERT architecture that supports the symmetric bidirectional variant of [ALiBi](https://arxiv.org/abs/2108.12409) to allow longer sequence length. The Jina-ColBERT model is trained on MSMARCO passage ranking dataset, following a very similar training procedure with ColBERTv2. The only difference is that we use `jina-bert-v2-base-en` as the backbone instead of `bert-base-uncased`. |
|
|
|
For more information about ColBERT, please refer to the [ColBERTv1](https://arxiv.org/abs/2004.12832) and [ColBERTv2](https://arxiv.org/abs/2112.01488v3) paper, and [the original code](https://github.com/stanford-futuredata/ColBERT). |
|
|
|
## Usage |
|
|
|
We strongly recommend following the same usage as original ColBERT to use this model. |
|
|
|
### Installation |
|
|
|
To use this model, you will need to install the latest version of the ColBERT repository: |
|
|
|
```bash |
|
pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2 |
|
``` |
|
|
|
### Indexing |
|
|
|
```python |
|
from colbert import Indexer |
|
from colbert.infra import Run, RunConfig, ColBERTConfig |
|
|
|
n_gpu: int = 1 # Set your number of available GPUs |
|
experiment: str = "" # Name of the folder where the logs and created indices will be stored |
|
index_name: str = "" # The name of your index, i.e. the name of your vector database |
|
|
|
with Run().context(RunConfig(nranks=n_gpu, experiment=experiment)): |
|
config = ColBERTConfig(doc_maxlen=8192) # Our model supports 8k context length for indexing long documents |
|
indexer = Indexer( |
|
checkpoint="jinaai/jina-colbert-v1-en", |
|
config=config, |
|
) |
|
documents = [ |
|
"ColBERT is an efficient and effective passage retrieval model.", |
|
"Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length.", |
|
... |
|
] |
|
indexer.index(name=index_name, collection=documents) |
|
``` |
|
|
|
### Searching |
|
|
|
```python |
|
from colbert import Searcher |
|
from colbert.infra import Run, RunConfig, ColBERTConfig |
|
|
|
n_gpu: int = 0 |
|
experiment: str = "" # Name of the folder where the logs and created indices will be stored |
|
index_name: str = "" # Name of your previously created index where the documents you want to search are stored. |
|
k: int = 10 # how many results you want to retrieve |
|
|
|
with Run().context(RunConfig(nranks=n_gpu, experiment=experiment)): |
|
config = ColBERTConfig(query_maxlen=128) # Although the model supports 8k context length, we suggest not to use a very long query, as it may cause significant computational complexity and CUDA memory usage. |
|
searcher = Searcher( |
|
index=index_name, |
|
config=config |
|
) # You don't need to specify checkpoint again, the model name is stored in the index. |
|
query = "How to use ColBERT for indexing long documents?" |
|
results = searcher.search(query, k=k) |
|
# results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...) |
|
``` |
|
|
|
## Evaluation Results |
|
|
|
**TL;DR:** Our Jina-ColBERT achieves the competitive retrieval performance with [ColBERTv2](https://huggingface.co/colbert-ir/colbertv2.0) on all benchmarks, and outperforms ColBERTv2 on datasets in where documents have longer context length. |
|
|
|
### In-domain benchmarks |
|
|
|
We evaluate the in-domain performance on the dev subset of MSMARCO passage ranking dataset. We follow the same evaluation settings in ColBERTv2 paper and rerun the results of ColBERTv2 using the released checkpoint. |
|
|
|
| Model | MRR@10 | Recall@50 | Recall@1k | |
|
| --- | :---: | :---: | :---: | |
|
| ColBERTv2 | 39.7 | 86.8 | 97.6 | |
|
| Jina-ColBERT-v1 | 39.0 | 85.6 | 96.2 | |
|
|
|
### Out-of-domain benchmarks |
|
|
|
Following ColBERTv2, we evaluate the out-of-domain performance on 13 public BEIR datasets and use NDCG@10 as the main metric. We follow the same evaluation settings in ColBERTv2 paper and rerun the results of ColBERTv2 using the released checkpoint. |
|
|
|
Note that both ColBERTv2 and Jina-ColBERT-v1 only employ MSMARCO passage ranking dataset for training, so below results are the fully zero-shot performance. |
|
|
|
| dataset | ColBERTv2 | Jina-ColBERT-v1 | |
|
| --- | :---: | :---: | |
|
| ArguAna | 46.5 | 49.4 | |
|
| ClimateFEVER | 18.1 | 19.6 | |
|
| DBPedia | 45.2 | 41.3 | |
|
| FEVER | 78.8 | 79.5 | |
|
| FiQA | 35.4 | 36.8 | |
|
| HotPotQA | 67.5 | 65.6 | |
|
| NFCorpus | 33.7 | 33.8 | |
|
| NQ | 56.1 | 54.9 | |
|
| Quora | 85.5 | 82.3 | |
|
| SCIDOCS | 15.4 | 16.9 | |
|
| SciFact | 68.9 | 70.1 | |
|
| TREC-COVID | 72.6 | 75.0 | |
|
| Webis-touché2020 | 26.0 | 27.0 | |
|
| Average | 50.0 | 50.2 | |
|
|
|
### Long context datasets |
|
|
|
We also evaluate the zero-shot performance on datasets in where documents have longer context length and compare with some long-context embedding models. |
|
|
|
| Model | Avg. NDCG@10 | Model max context length | Used context length | |
|
| --- | :---: | :---: | :---: | |
|
| ColBERTv2 | 74.3 | 512 | 512 | |
|
| Jina-ColBERT-v1 | 75.5 | 8192 | 512 | |
|
| Jina-ColBERT-v1 | 83.7 | 8192 | 8192* | |
|
| Jina-embeddings-v2-base-en | 85.4 | 8192 | 8192 | |
|
|
|
\* denotes that we used the context length of 8192 for document but the query length is still 512. |
|
|
|
**To summarize, Jina-ColBERT achieves the comparable performance with ColBERTv2 on all benchmarks, and outperforms ColBERTv2 on datasets in where documents have longer context length.** |
|
|
|
## Plans |
|
|
|
- We will evaluate the performance of Jina-ColBERT as a reranker in a retrieval pipeline, and add the usage examples. |
|
- We are planning to improve the performance of Jina-ColBERT by fine-tuning on more datasets in the future! |
|
|
|
## Other Models |
|
|
|
Additionally, we provide the following embedding models, you can also use them for retrieval. |
|
|
|
- [`jina-embeddings-v2-base-en`](https://huggingface.co/jinaai/jina-embeddings-v2-base-en): 137 million parameters. |
|
- [`jina-embeddings-v2-base-zh`](https://huggingface.co/jinaai/jina-embeddings-v2-base-zh): 161 million parameters Chinese-English bilingual model. |
|
- [`jina-embeddings-v2-base-de`](https://huggingface.co/jinaai/jina-embeddings-v2-base-de): 161 million parameters German-English bilingual model. |
|
- [`jina-embeddings-v2-base-es`](): 161 million parameters Spanish-English bilingual model (soon). |
|
|
|
## Contact |
|
|
|
Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas. |