bwang0911 commited on
Commit
059e114
1 Parent(s): 2343159

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -18
README.md CHANGED
@@ -22,7 +22,7 @@ datasets:
22
 
23
  # Jina-ColBERT
24
 
25
- ### Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both _8k context length_ and _fast and accurate retrieval_.
26
 
27
  [JinaBERT](https://arxiv.org/abs/2310.19923) is a BERT architecture that supports the symmetric bidirectional variant of [ALiBi](https://arxiv.org/abs/2108.12409) to allow longer sequence length. The Jina-ColBERT model is trained on MSMARCO passage ranking dataset, following a very similar training procedure with ColBERTv2. The only difference is that we use `jina-bert-v2-base-en` as the backbone instead of `bert-base-uncased`.
28
 
@@ -30,11 +30,9 @@ For more information about ColBERT, please refer to the [ColBERTv1](https://arxi
30
 
31
  ## Usage
32
 
33
- We strongly recommend following the same usage as the original ColBERT to use this model.
34
-
35
  ### Installation
36
 
37
- To use this model, you will need to install the **latest version** of the ColBERT repository (if not the latest version the ColBERT code may not support models that use the custom code and cause an assertion error):
38
 
39
  ```bash
40
  pip install git+https://github.com/stanford-futuredata/ColBERT.git torch
@@ -73,18 +71,6 @@ if __name__ == "__main__":
73
  indexer.index(name=index_name, collection=documents)
74
  ```
75
 
76
- ### Creating Vectors
77
-
78
-
79
- ```python
80
- from colbert.modeling.checkpoint import Checkpoint
81
- ckpt = Checkpoint("jinaai/jina-colbert-v1-en", colbert_config=ColBERTConfig(root="experiments"))
82
- queries = ckpt.queryFromText(["What does ColBERT do?", "This is a search query?"], bsize=16)
83
- document_vectors = ckpt.docFromText(documents, bsize=32)[0]
84
- ```
85
-
86
- Complete working Colab Notebook is [here](https://colab.research.google.com/drive/1-5WGEYPSBNBg-Z0bGFysyvckFuM8imrg)
87
-
88
  ### Searching
89
 
90
  ```python
@@ -110,6 +96,20 @@ if __name__ == "__main__":
110
  # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
111
  ```
112
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
113
  ## Evaluation Results
114
 
115
  **TL;DR:** Our Jina-ColBERT achieves the competitive retrieval performance with [ColBERTv2](https://huggingface.co/colbert-ir/colbertv2.0) on all benchmarks, and outperforms ColBERTv2 on datasets in where documents have longer context length.
@@ -164,7 +164,7 @@ We also evaluate the zero-shot performance on datasets where documents have long
164
  ## Plans
165
 
166
  - We will evaluate the performance of Jina-ColBERT as a reranker in a retrieval pipeline, and add the usage examples.
167
- - We are planning to improve the performance of Jina-ColBERT by fine-tuning on more datasets in the future!
168
 
169
  ## Other Models
170
 
@@ -173,7 +173,7 @@ Additionally, we provide the following embedding models, you can also use them f
173
  - [`jina-embeddings-v2-base-en`](https://huggingface.co/jinaai/jina-embeddings-v2-base-en): 137 million parameters.
174
  - [`jina-embeddings-v2-base-zh`](https://huggingface.co/jinaai/jina-embeddings-v2-base-zh): 161 million parameters Chinese-English bilingual model.
175
  - [`jina-embeddings-v2-base-de`](https://huggingface.co/jinaai/jina-embeddings-v2-base-de): 161 million parameters German-English bilingual model.
176
- - [`jina-embeddings-v2-base-es`](): 161 million parameters Spanish-English bilingual model (soon).
177
 
178
  ## Contact
179
 
 
22
 
23
  # Jina-ColBERT
24
 
25
+ **Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both _8k context length_, _fast and accurate retrieval_.**
26
 
27
  [JinaBERT](https://arxiv.org/abs/2310.19923) is a BERT architecture that supports the symmetric bidirectional variant of [ALiBi](https://arxiv.org/abs/2108.12409) to allow longer sequence length. The Jina-ColBERT model is trained on MSMARCO passage ranking dataset, following a very similar training procedure with ColBERTv2. The only difference is that we use `jina-bert-v2-base-en` as the backbone instead of `bert-base-uncased`.
28
 
 
30
 
31
  ## Usage
32
 
 
 
33
  ### Installation
34
 
35
+ To use this model, you will need to install the **latest version** of the ColBERT repository:
36
 
37
  ```bash
38
  pip install git+https://github.com/stanford-futuredata/ColBERT.git torch
 
71
  indexer.index(name=index_name, collection=documents)
72
  ```
73
 
 
 
 
 
 
 
 
 
 
 
 
 
74
  ### Searching
75
 
76
  ```python
 
96
  # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
97
  ```
98
 
99
+
100
+ ### Creating Vectors
101
+
102
+
103
+ ```python
104
+ from colbert.modeling.checkpoint import Checkpoint
105
+
106
+ ckpt = Checkpoint("jinaai/jina-colbert-v1-en", colbert_config=ColBERTConfig(root="experiments"))
107
+ query_vectors = ckpt.queryFromText(["What does ColBERT do?", "This is a search query?"], bsize=16)
108
+ print(query_vectors)
109
+ ```
110
+
111
+ Complete working Colab Notebook is [here](https://colab.research.google.com/drive/1-5WGEYPSBNBg-Z0bGFysyvckFuM8imrg)
112
+
113
  ## Evaluation Results
114
 
115
  **TL;DR:** Our Jina-ColBERT achieves the competitive retrieval performance with [ColBERTv2](https://huggingface.co/colbert-ir/colbertv2.0) on all benchmarks, and outperforms ColBERTv2 on datasets in where documents have longer context length.
 
164
  ## Plans
165
 
166
  - We will evaluate the performance of Jina-ColBERT as a reranker in a retrieval pipeline, and add the usage examples.
167
+ - We are planning to improve the performance of Jina-ColBERT by fine-tuning on more datasets in the future.
168
 
169
  ## Other Models
170
 
 
173
  - [`jina-embeddings-v2-base-en`](https://huggingface.co/jinaai/jina-embeddings-v2-base-en): 137 million parameters.
174
  - [`jina-embeddings-v2-base-zh`](https://huggingface.co/jinaai/jina-embeddings-v2-base-zh): 161 million parameters Chinese-English bilingual model.
175
  - [`jina-embeddings-v2-base-de`](https://huggingface.co/jinaai/jina-embeddings-v2-base-de): 161 million parameters German-English bilingual model.
176
+ - [`jina-embeddings-v2-base-es`](https://huggingface.co/jinaai/jina-embeddings-v2-base-es): 161 million parameters Spanish-English bilingual model.
177
 
178
  ## Contact
179