jinaai
/

jina-colbert-v1-en

@@ -22,7 +22,7 @@ datasets:
 # Jina-ColBERT
-### Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both _8k context length_ and _fast and accurate retrieval_.
 [JinaBERT](https://arxiv.org/abs/2310.19923) is a BERT architecture that supports the symmetric bidirectional variant of [ALiBi](https://arxiv.org/abs/2108.12409) to allow longer sequence length. The Jina-ColBERT model is trained on MSMARCO passage ranking dataset, following a very similar training procedure with ColBERTv2. The only difference is that we use `jina-bert-v2-base-en` as the backbone instead of `bert-base-uncased`.
@@ -30,11 +30,9 @@ For more information about ColBERT, please refer to the [ColBERTv1](https://arxi
 ## Usage
-We strongly recommend following the same usage as the original ColBERT to use this model.
 ### Installation
-To use this model, you will need to install the **latest version** of the ColBERT repository (if not the latest version the ColBERT code may not support models that use the custom code and cause an assertion error):
 ```bash
 pip install git+https://github.com/stanford-futuredata/ColBERT.git torch
@@ -73,18 +71,6 @@ if __name__ == "__main__":
         indexer.index(name=index_name, collection=documents)
 ```
-### Creating Vectors
-```python
-from colbert.modeling.checkpoint import Checkpoint
-ckpt = Checkpoint("jinaai/jina-colbert-v1-en", colbert_config=ColBERTConfig(root="experiments"))
-queries = ckpt.queryFromText(["What does ColBERT do?", "This is a search query?"], bsize=16)
-document_vectors = ckpt.docFromText(documents, bsize=32)[0]
-```
-Complete working Colab Notebook is [here](https://colab.research.google.com/drive/1-5WGEYPSBNBg-Z0bGFysyvckFuM8imrg)
 ### Searching
 ```python
@@ -110,6 +96,20 @@ if __name__ == "__main__":
         # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
 ```
 ## Evaluation Results
 **TL;DR:** Our Jina-ColBERT achieves the competitive retrieval performance with [ColBERTv2](https://huggingface.co/colbert-ir/colbertv2.0) on all benchmarks, and outperforms ColBERTv2 on datasets in where documents have longer context length.
@@ -164,7 +164,7 @@ We also evaluate the zero-shot performance on datasets where documents have long
 ## Plans
 - We will evaluate the performance of Jina-ColBERT as a reranker in a retrieval pipeline, and add the usage examples.
-- We are planning to improve the performance of Jina-ColBERT by fine-tuning on more datasets in the future!
 ## Other Models
@@ -173,7 +173,7 @@ Additionally, we provide the following embedding models, you can also use them f
 - [`jina-embeddings-v2-base-en`](https://huggingface.co/jinaai/jina-embeddings-v2-base-en): 137 million parameters.
 - [`jina-embeddings-v2-base-zh`](https://huggingface.co/jinaai/jina-embeddings-v2-base-zh): 161 million parameters Chinese-English bilingual model.
 - [`jina-embeddings-v2-base-de`](https://huggingface.co/jinaai/jina-embeddings-v2-base-de): 161 million parameters German-English bilingual model.
-- [`jina-embeddings-v2-base-es`](): 161 million parameters Spanish-English bilingual model (soon).
 ## Contact

 # Jina-ColBERT
+**Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both _8k context length_, _fast and accurate retrieval_.**
 [JinaBERT](https://arxiv.org/abs/2310.19923) is a BERT architecture that supports the symmetric bidirectional variant of [ALiBi](https://arxiv.org/abs/2108.12409) to allow longer sequence length. The Jina-ColBERT model is trained on MSMARCO passage ranking dataset, following a very similar training procedure with ColBERTv2. The only difference is that we use `jina-bert-v2-base-en` as the backbone instead of `bert-base-uncased`.
 ## Usage
 ### Installation
+To use this model, you will need to install the **latest version** of the ColBERT repository:
 ```bash
 pip install git+https://github.com/stanford-futuredata/ColBERT.git torch
         indexer.index(name=index_name, collection=documents)
 ```
 ### Searching
 ```python
         # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
 ```
+### Creating Vectors
+```python
+from colbert.modeling.checkpoint import Checkpoint
+ckpt = Checkpoint("jinaai/jina-colbert-v1-en", colbert_config=ColBERTConfig(root="experiments"))
+query_vectors = ckpt.queryFromText(["What does ColBERT do?", "This is a search query?"], bsize=16)
+print(query_vectors)
+```
+Complete working Colab Notebook is [here](https://colab.research.google.com/drive/1-5WGEYPSBNBg-Z0bGFysyvckFuM8imrg)
 ## Evaluation Results
 **TL;DR:** Our Jina-ColBERT achieves the competitive retrieval performance with [ColBERTv2](https://huggingface.co/colbert-ir/colbertv2.0) on all benchmarks, and outperforms ColBERTv2 on datasets in where documents have longer context length.
 ## Plans
 - We will evaluate the performance of Jina-ColBERT as a reranker in a retrieval pipeline, and add the usage examples.
+- We are planning to improve the performance of Jina-ColBERT by fine-tuning on more datasets in the future.
 ## Other Models
 - [`jina-embeddings-v2-base-en`](https://huggingface.co/jinaai/jina-embeddings-v2-base-en): 137 million parameters.
 - [`jina-embeddings-v2-base-zh`](https://huggingface.co/jinaai/jina-embeddings-v2-base-zh): 161 million parameters Chinese-English bilingual model.
 - [`jina-embeddings-v2-base-de`](https://huggingface.co/jinaai/jina-embeddings-v2-base-de): 161 million parameters German-English bilingual model.
+- [`jina-embeddings-v2-base-es`](https://huggingface.co/jinaai/jina-embeddings-v2-base-es): 161 million parameters Spanish-English bilingual model.
 ## Contact