Create README.md

fca9116 verified 9 months ago

7.08 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- ColBERT
	- passage-retrieval
	datasets:
	- ms_marco
	---

	<br><br>

	<p align="center">
	<img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
	</p>


	<p align="center">
	<b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
	</p>

	# Jina-ColBERT

	### Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both _8k context length_ and _fast and accurate retrieval_.

	[JinaBERT](https://arxiv.org/abs/2310.19923) is a BERT architecture that supports the symmetric bidirectional variant of [ALiBi](https://arxiv.org/abs/2108.12409) to allow longer sequence length. The Jina-ColBERT model is trained on MSMARCO passage ranking dataset, following a very similar training procedure with ColBERTv2. The only difference is that we use `jina-bert-v2-base-en` as the backbone instead of `bert-base-uncased`.

	For more information about ColBERT, please refer to the [ColBERTv1](https://arxiv.org/abs/2004.12832) and [ColBERTv2](https://arxiv.org/abs/2112.01488v3) paper, and [the original code](https://github.com/stanford-futuredata/ColBERT).

	## Usage

	We strongly recommend following the same usage as original ColBERT to use this model.

	### Installation

	To use this model, you will need to install the latest version of the ColBERT repository:

	```bash
	pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
	```

	### Indexing

	```python
	from colbert import Indexer
	from colbert.infra import Run, RunConfig, ColBERTConfig

	n_gpu: int = 1 # Set your number of available GPUs
	experiment: str = "" # Name of the folder where the logs and created indices will be stored
	index_name: str = "" # The name of your index, i.e. the name of your vector database

	with Run().context(RunConfig(nranks=n_gpu, experiment=experiment)):
	config = ColBERTConfig(doc_maxlen=8192) # Our model supports 8k context length for indexing long documents
	indexer = Indexer(
	checkpoint="jinaai/jina-colbert-v1-en",
	config=config,
	)
	documents = [
	"ColBERT is an efficient and effective passage retrieval model.",
	"Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length.",
	...
	]
	indexer.index(name=index_name, collection=documents)
	```

	### Searching

	```python
	from colbert import Searcher
	from colbert.infra import Run, RunConfig, ColBERTConfig

	n_gpu: int = 0
	experiment: str = "" # Name of the folder where the logs and created indices will be stored
	index_name: str = "" # Name of your previously created index where the documents you want to search are stored.
	k: int = 10 # how many results you want to retrieve

	with Run().context(RunConfig(nranks=n_gpu, experiment=experiment)):
	config = ColBERTConfig(query_maxlen=128) # Although the model supports 8k context length, we suggest not to use a very long query, as it may cause significant computational complexity and CUDA memory usage.
	searcher = Searcher(
	index=index_name,
	config=config
	) # You don't need to specify checkpoint again, the model name is stored in the index.
	query = "How to use ColBERT for indexing long documents?"
	results = searcher.search(query, k=k)
	# results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
	```

	## Evaluation Results

	TL;DR: Our Jina-ColBERT achieves the competitive retrieval performance with [ColBERTv2](https://huggingface.co/colbert-ir/colbertv2.0) on all benchmarks, and outperforms ColBERTv2 on datasets in where documents have longer context length.

	### In-domain benchmarks

	We evaluate the in-domain performance on the dev subset of MSMARCO passage ranking dataset. We follow the same evaluation settings in ColBERTv2 paper and rerun the results of ColBERTv2 using the released checkpoint.

	\| Model \| MRR@10 \| Recall@50 \| Recall@1k \|
	\| --- \| :---: \| :---: \| :---: \|
	\| ColBERTv2 \| 39.7 \| 86.8 \| 97.6 \|
	\| Jina-ColBERT-v1 \| 39.0 \| 85.6 \| 96.2 \|

	### Out-of-domain benchmarks

	Following ColBERTv2, we evaluate the out-of-domain performance on 13 public BEIR datasets and use NDCG@10 as the main metric. We follow the same evaluation settings in ColBERTv2 paper and rerun the results of ColBERTv2 using the released checkpoint.

	Note that both ColBERTv2 and Jina-ColBERT-v1 only employ MSMARCO passage ranking dataset for training, so below results are the fully zero-shot performance.

	\| dataset \| ColBERTv2 \| Jina-ColBERT-v1 \|
	\| --- \| :---: \| :---: \|
	\| ArguAna \| 46.5 \| 49.4 \|
	\| ClimateFEVER \| 18.1 \| 19.6 \|
	\| DBPedia \| 45.2 \| 41.3 \|
	\| FEVER \| 78.8 \| 79.5 \|
	\| FiQA \| 35.4 \| 36.8 \|
	\| HotPotQA \| 67.5 \| 65.6 \|
	\| NFCorpus \| 33.7 \| 33.8 \|
	\| NQ \| 56.1 \| 54.9 \|
	\| Quora \| 85.5 \| 82.3 \|
	\| SCIDOCS \| 15.4 \| 16.9 \|
	\| SciFact \| 68.9 \| 70.1 \|
	\| TREC-COVID \| 72.6 \| 75.0 \|
	\| Webis-touché2020 \| 26.0 \| 27.0 \|
	\| Average \| 50.0 \| 50.2 \|

	### Long context datasets

	We also evaluate the zero-shot performance on datasets in where documents have longer context length and compare with some long-context embedding models.

	\| Model \| Avg. NDCG@10 \| Model max context length \| Used context length \|
	\| --- \| :---: \| :---: \| :---: \|
	\| ColBERTv2 \| 74.3 \| 512 \| 512 \|
	\| Jina-ColBERT-v1 \| 75.5 \| 8192 \| 512 \|
	\| Jina-ColBERT-v1 \| 83.7 \| 8192 \| 8192* \|
	\| Jina-embeddings-v2-base-en \| 85.4 \| 8192 \| 8192 \|

	\* denotes that we used the context length of 8192 for document but the query length is still 512.

	To summarize, Jina-ColBERT achieves the comparable performance with ColBERTv2 on all benchmarks, and outperforms ColBERTv2 on datasets in where documents have longer context length.

	## Plans

	- We will evaluate the performance of Jina-ColBERT as a reranker in a retrieval pipeline, and add the usage examples.
	- We are planning to improve the performance of Jina-ColBERT by fine-tuning on more datasets in the future!

	## Other Models

	Additionally, we provide the following embedding models, you can also use them for retrieval.

	- [`jina-embeddings-v2-base-en`](https://huggingface.co/jinaai/jina-embeddings-v2-base-en): 137 million parameters.
	- [`jina-embeddings-v2-base-zh`](https://huggingface.co/jinaai/jina-embeddings-v2-base-zh): 161 million parameters Chinese-English bilingual model.
	- [`jina-embeddings-v2-base-de`](https://huggingface.co/jinaai/jina-embeddings-v2-base-de): 161 million parameters German-English bilingual model.
	- [`jina-embeddings-v2-base-es`](): 161 million parameters Spanish-English bilingual model (soon).

	## Contact

	Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.