Aamir commited on
Commit
d60b0d4
1 Parent(s): 3a47ada

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -94
README.md CHANGED
@@ -1,7 +1,5 @@
1
  ---
2
  license: apache-2.0
3
- language:
4
- - en
5
  ---
6
 
7
  <br><br>
@@ -17,33 +15,23 @@ language:
17
  # mxbai-colbert-v1
18
 
19
  This is our first English ColBERT model, which is built upon our sentence embedding model [mixedbread-ai/mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1).
20
- You can learn more about the models in our [blog post](https://www.mixedbread.ai/blog/).
21
 
22
 
23
  ## Quickstart
24
 
25
- Currently, the best way to use it is with the official [ColBERT](https://github.com/stanford-futuredata/ColBERT) library.
26
 
27
- `python -m pip install -U colbert-ai[faiss-gpu]`
28
 
29
- Here, we provide several ways to use it.
30
 
31
- ### 1. Generate Embeddings
32
 
33
  ```python
34
- from huggingface_hub import snapshot_download
35
- from colbert.modeling.checkpoint import Checkpoint
36
- from colbert.infra import Run, RunConfig, ColBERTConfig
37
 
38
- # To ensure the total files are cached locally
39
- snapshot_download(repo_id="mixedbread-ai/mxbai-colbert-v1")
40
 
41
- # load mixedbread colbert
42
- ckpt = Checkpoint("mixedbread-ai/mxbai-colbert-v1",
43
- colbert_config=ColBERTConfig())
44
-
45
- # encode query and documents
46
- query = "Who wrote 'To Kill a Mockingbird'?"
47
  documents = [
48
  "'To Kill a Mockingbird' is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.",
49
  "The novel 'Moby-Dick' was written by Herman Melville and first published in 1851. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.",
@@ -52,63 +40,29 @@ documents = [
52
  "The 'Harry Potter' series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era.",
53
  "'The Great Gatsby', a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan."
54
  ]
55
- query_vectors = ckpt.queryFromText([query], bsize=16)
56
- doc_vectors = ckpt.docFromText(documents, bsize=16)
57
- ```
58
 
59
- ### 2. Index & Search
60
-
61
- 1) Index
62
-
63
- ```python
64
- from huggingface_hub import snapshot_download
65
- from colbert import Indexer
66
- from colbert.infra import Run, RunConfig, ColBERTConfig
67
 
68
- # To ensure the total files are cached locally
69
- snapshot_download(repo_id="mixedbread-ai/mxbai-colbert-v1")
70
-
71
-
72
- gpu_count = 1
73
- documents = [
74
- "'To Kill a Mockingbird' is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.",
75
- "The novel 'Moby-Dick' was written by Herman Melville and first published in 1851. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.",
76
- "Harper Lee, an American novelist widely known for her novel 'To Kill a Mockingbird', was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.",
77
- "Jane Austen was an English novelist known primarily for her six major novels, which interpret, critique and comment upon the British landed gentry at the end of the 18th century.",
78
- "The 'Harry Potter' series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era.",
79
- "'The Great Gatsby', a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan."
80
- ]
81
-
82
- with Run().context(RunConfig(nranks=gpu_count, gpus=gpu_count, experiment='experiments')):
83
- config = ColBERTConfig(
84
- doc_maxlen=512
85
- )
86
- indexer = Indexer(
87
- checkpoint="mixedbread-ai/mxbai-colbert-v1",
88
- config=config,
89
- )
90
- indexer.index(name='demo', collection=documents)
91
-
92
  ```
93
 
94
- 2) Search
95
 
96
- ```python
97
- from colbert import Searcher
98
- from colbert.infra import Run, RunConfig, ColBERTConfig
99
-
100
- gpu_count = 1
101
-
102
- with Run().context(RunConfig(nranks=1, experiment='experiments')):
103
- config = ColBERTConfig(
104
- query_maxlen=128
105
- )
106
- searcher = Searcher(
107
- index='demo',
108
- config=config
109
- )
110
- query = "Who wrote 'To Kill a Mockingbird'?"
111
- results = searcher.search(query, k=3)
112
  ```
113
 
114
  ## Using API
@@ -119,40 +73,42 @@ You’ll be able to use the models through our API as well. The API is coming so
119
 
120
  ### 1. Reranking Performance
121
 
122
- **Setup:** we use BM25 as the first-stage retrieval model, and then use ColBERT for reranking. Following common practice, we report NDCG@10 as the metrics.
123
 
124
  Here, we compare our model with two widely used ColBERT models, as follows:
125
 
126
 
127
- | Model | ColBERTv2 | Jina-ColBERT-V1 | Mxbai-ColBERT-V1|
128
- | ---------------------- | -------- | ---------- | ---------- |
129
- | dbpedia-entity | 31.8 | **42.2** | 40.6 |
130
- | fiqa | 23.6 | 35.6 | **35.9** |
131
- | nfcorpus | 33.8 | **36.7** | 36.4 |
132
- | nq | 30.6 | 51.3 | **51.4** |
133
- | scidocs | 14.9 | 15.4 | **17.0** |
134
- | scifact | 67.9 | 70.2 | **71.5** |
135
- | trec-covid | 59.5 | 75.0 | **81.0** |
136
- | webis-touche2020 | 44.2 | 32.1 | 31.7 |
137
- | signal1m | **33.2** | 30.9 | 33.1 |
138
- | trec-news | 46.0 | 45.2 | **47.1** |
139
- | robust04 | 47.5 | **47.7** | 47.5 |
140
- | avg | 39.4 | 43.8 | **44.8** |
141
-
142
- Find more in our [blog-post](https://www.mixedbread.ai/blog/) and on this [spreadsheet](https://docs.google.com/spreadsheets/d/1ZT_KN40PnHQa21hTdrk4_9GCnqm916lJJz3W83mo1og/edit?usp=sharing).
 
 
143
 
144
  ### 2. Retrieval Performance
145
 
146
- ColBERT is mainly used for reranking. Here, we also test our model's performance on retrieval tasks.
147
 
148
  Due to resource limitations, we only test our model on three beir tasks. NDCG@10 servers as the main metric.
149
 
150
 
151
- | Model | ColBERTv2 | Jina-ColBERT-V1 | Mxbai-ColBERT-V1|
152
- | ---------------------- | -------- | ---------- | ---------- |
153
- | scifact | 68.9 | 70.1 | **71.3** |
154
- | nfcorpus | 33.7 | 33.8 | **36.5** |
155
- | trec-covid | 72.6 | 75.0 | **80.5** |
156
 
157
  Although our ColBERT also performs well on retrieval, we recommend using our embedding model [mixedbread-ai/mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) for retrieval.
158
 
@@ -162,4 +118,4 @@ Please join our [Discord Community](https://discord.gg/jDfMHzAVfU) and share you
162
 
163
 
164
  ## License
165
- Apache 2.0
 
1
  ---
2
  license: apache-2.0
 
 
3
  ---
4
 
5
  <br><br>
 
15
  # mxbai-colbert-v1
16
 
17
  This is our first English ColBERT model, which is built upon our sentence embedding model [mixedbread-ai/mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1).
18
+ You can learn more about the models in our [blog post](https://www.mixedbread.ai/blog/mxbai-colbert-large-v1).
19
 
20
 
21
  ## Quickstart
22
 
23
+ We recommend using the [RAGatouille](https://github.com/bclavie/RAGatouille) for using our ColBERT model.
24
 
25
+ `pip install ragatouille`
26
 
 
27
 
 
28
 
29
  ```python
30
+ from ragatouille import RAGPretrainedModel
 
 
31
 
32
+ # Let's create a ragatouille instance
33
+ RAG = RAGPretrainedModel("mixedbread-ai/mxbai-colbert-v1")
34
 
 
 
 
 
 
 
35
  documents = [
36
  "'To Kill a Mockingbird' is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.",
37
  "The novel 'Moby-Dick' was written by Herman Melville and first published in 1851. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.",
 
40
  "The 'Harry Potter' series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era.",
41
  "'The Great Gatsby', a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan."
42
  ]
 
 
 
43
 
44
+ # index documents
45
+ RAG.index(documents, index_name="mockingbird")
 
 
 
 
 
 
46
 
47
+ # search
48
+ query = "Who wrote 'To Kill a Mockingbird'?"
49
+ results = RAG.search(query)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
  ```
51
 
52
+ The result looks like this:
53
 
54
+ ```
55
+ [{'content': "'To Kill a Mockingbird' is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.",
56
+ 'score': 28.453125,
57
+ 'rank': 1,
58
+ 'document_id': '9d564e82-f14f-433a-ab40-b10bda9dc370',
59
+ 'passage_id': 0},
60
+ {'content': "Harper Lee, an American novelist widely known for her novel 'To Kill a Mockingbird', was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.",
61
+ 'score': 27.03125,
62
+ 'rank': 2,
63
+ 'document_id': 'a35a89c3-b610-4e2e-863e-fa1e7e0710a6',
64
+ 'passage_id': 2},
65
+ ...]
 
 
 
 
66
  ```
67
 
68
  ## Using API
 
73
 
74
  ### 1. Reranking Performance
75
 
76
+ **Setup:** we use BM25 as the first-stage retrieval model, and then use ColBERT for reranking. We evaluate the out-of-domain performance on 13 public BEIR datasets. Following common practice, we report NDCG@10 as the metrics.
77
 
78
  Here, we compare our model with two widely used ColBERT models, as follows:
79
 
80
 
81
+ | Dataset | ColBERTv2 | Jina-ColBERT-v1 | mxbai-colbert-large-v1 |
82
+ | ---------------- | --------: | --------------: | ---------------------: |
83
+ | ArguAna | 29.99 | **33.42** | 33.11 |
84
+ | ClimateFEVER | 16.51 | 20.66 | **20.85** |
85
+ | DBPedia | 31.80 | **42.16** | 40.61 |
86
+ | FEVER | 65.13 | **81.07** | 80.75 |
87
+ | FiQA | 23.61 | 35.60 | **35.86** |
88
+ | HotPotQA | 63.30 | **68.84** | 67.62 |
89
+ | NFCorpus | 33.75 | **36.69** | 36.37 |
90
+ | NQ | 30.55 | 51.27 | **51.43** |
91
+ | Quora | 78.86 | 85.18 | **86.95** |
92
+ | SCIDOCS | 14.90 | 15.39 | **16.98** |
93
+ | SciFact | 67.89 | 70.2 | **71.48** |
94
+ | TREC-COVID | 59.47 | 75.00 | **81.04** |
95
+ | Webis-touché2020 | **44.22** | 32.12 | 31.70 |
96
+ | Average | 43.08 | 49.82 | **50.37** |
97
+
98
+ Find more in our [blog-post](https://www.mixedbread.ai/blog/mxbai-rerank-v1) and on this [spreadsheet](https://docs.google.com/spreadsheets/d/1ZT_KN40PnHQa21hTdrk4_9GCnqm916lJJz3W83mo1og/edit?usp=sharing).
99
 
100
  ### 2. Retrieval Performance
101
 
102
+ ColBERT is mainly used for reranking. Here, we also test our model's performance on retrieval tasks on a subset of the BEIR benchmarks.
103
 
104
  Due to resource limitations, we only test our model on three beir tasks. NDCG@10 servers as the main metric.
105
 
106
 
107
+ | Model | ColBERTv2 | Jina-ColBERT-V1 | mxbai-colbert-large-v1 |
108
+ | ---------- | --------: | --------------: | ---------------------: |
109
+ | NFCorpus | 33.7 | 33.8 | **36.5** |
110
+ | SciFact | 68.9 | 70.1 | **71.3** |
111
+ | TREC-COVID | 72.6 | 75.0 | **80.5** |
112
 
113
  Although our ColBERT also performs well on retrieval, we recommend using our embedding model [mixedbread-ai/mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) for retrieval.
114
 
 
118
 
119
 
120
  ## License
121
+ Apache 2.0