antoinelouis commited on
Commit
1bd87bc
1 Parent(s): 3b72f33

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -14
README.md CHANGED
@@ -9,7 +9,7 @@ metrics:
9
  tags:
10
  - feature-extraction
11
  - sentence-similarity
12
- library_name: colbert-ai
13
  inference: false
14
  language:
15
  - multilingual
@@ -112,25 +112,25 @@ language:
112
  <p>
113
  </h4>
114
 
115
- This is a [colbert-ai](https://github.com/stanford-futuredata/ColBERT) model: it encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. It can be used for tasks like clustering or semantic search. The model uses an [XMOD](https://huggingface.co/facebook/xmod-base) backbone, which allows it to learn from monolingual fine-tuning in a high-resource language, like English, and perform zero-shot retrieval across multiple languages.
116
 
117
  ## Usage
118
 
119
- Here are some examples for using ColBERT-XM with [colbert-ai](#using-colbert-ai) or [RAGatouille](#using-ragatouille).
120
 
121
- ### Using ColBERT-AI
122
 
123
- Start by installing the [library](https://github.com/stanford-futuredata/ColBERT) and some extra rquirements:
124
 
125
  ```
126
- pip install git+https://github.com/stanford-futuredata/ColBERT.git@main#egg=colbert-ai torchtorch==2.1.2 faiss-gpu==1.7.2
127
  ```
128
 
129
- Using the modeel on a collection of passages typically involves the following steps:
130
 
131
  - **Step 1: Indexing.** This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search. (⚠️ indexing requires a GPU!)
132
  ```
133
- from colbert import Indexer
134
  from colbert.infra import Run, RunConfig
135
 
136
  n_gpu: int = 1 # Set your number of available GPUs
@@ -138,7 +138,7 @@ experiment: str = "" # Name of the folder where the logs and created indices wil
138
  index_name: str = "" # The name of your index, i.e. the name of your vector database
139
 
140
  with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
141
- indexer = Indexer(checkpoint="antoinelouis/colbertv1-camembert-base-mmarcoFR")
142
  documents = [
143
  "Ceci est un premier document.",
144
  "Voici un second document.",
@@ -150,7 +150,7 @@ with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
150
 
151
  - **Step 2: Searching.** Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
152
  ```
153
- from colbert import Searcher
154
  from colbert.infra import Run, RunConfig
155
 
156
  n_gpu: int = 0
@@ -159,7 +159,7 @@ index_name: str = "" # Name of your previously created index where the documents
159
  k: int = 10 # how many results you want to retrieve
160
 
161
  with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
162
- searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
163
  query = "Comment effectuer une recherche avec ColBERT ?"
164
  results = searcher.search(query, k=k)
165
  # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
@@ -168,14 +168,14 @@ with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
168
 
169
  ### Using RAGatouille
170
 
171
- [To come...]
172
 
173
  ***
174
 
175
  ## Evaluation
176
 
177
  - **MS MARCO**:
178
- We evaluate our model on the small development set of [MS MARCO](https://ir-datasets.com/msmarco-passage.html#msmarco-passage/dev/small), which consists of 6,980 queries for a corpus of 8.8M candidate passages. Below, we compared its performance with other retrieval models on the official metrics for the dataset, i.e., mean reciprocal rank at cut-off 10 (MRR@10).
179
 
180
  | | model | Type | #Samples | #Params | en | es | fr | it | pt | id | de | ru | zh | ja | nl | vi | hi | ar | Avg. |
181
  |---:|:----------------------------------------------------------------------------------------------------------------------------------------|:--------------|:--------:|:-------:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|
@@ -189,13 +189,25 @@ We evaluate our model on the small development set of [MS MARCO](https://ir-data
189
  | 7 | [DPR-XM](https://huggingface.co/antoinelouis/dpr-xm) (ours) | single-vector | 25.6M | 277M | 32.7 | 23.6 | 23.5 | 22.3 | 22.7 | 22.0 | 22.1 | 19.9 | 18.1 | 18.7 | 22.9 | 18.0 | 16.0 | 15.1 | 21.3 |
190
  | 8 | **ColBERT-XM** (ours) | multi-vector | 6.4M | 277M | 37.2 | 28.5 | 26.9 | 26.5 | 27.6 | 26.3 | 27.0 | 25.1 | 24.6 | 24.1 | 27.5 | 22.6 | 23.8 | 19.5 | 26.2 |
191
 
 
 
 
 
 
 
 
 
 
 
 
 
192
  ***
193
 
194
  ## Training
195
 
196
  #### Data
197
 
198
- We use the training samples from the [MS MARCO passage ranking](https://ir-datasets.com/msmarco-passage.html#msmarco-passage/train) dataset, which contains 8.8M passages and 539K training queries. We do not employ the BM25 netaives provided by the official dataset but instead sample harder negatives mined from 12 distinct dense retrievers, using the [msmarco-hard-negatives](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives) distillation dataset. Our final training set consists of 6.4M (q, p+, p-) triples.
199
 
200
  #### Implementation
201
 
 
9
  tags:
10
  - feature-extraction
11
  - sentence-similarity
12
+ library_name: colbert-ir
13
  inference: false
14
  language:
15
  - multilingual
 
112
  <p>
113
  </h4>
114
 
115
+ This is a [colbert-ir](https://github.com/stanford-futuredata/ColBERT) model: it encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. It can be used for tasks like clustering or semantic search. The model uses an [XMOD](https://huggingface.co/facebook/xmod-base) backbone, which allows it to learn from monolingual fine-tuning in a high-resource language, like English, and perform zero-shot retrieval across multiple languages.
116
 
117
  ## Usage
118
 
119
+ Here are some examples for using ColBERT-XM with [colbert-ir](#using-colbert-ir) or [RAGatouille](#using-ragatouille).
120
 
121
+ ### Using ColBERT-IR
122
 
123
+ Start by installing the [library](https://github.com/stanford-futuredata/ColBERT) and some extra requirements:
124
 
125
  ```
126
+ pip install git+https://github.com/stanford-futuredata/ColBERT.git@main#egg=colbert-ir torchtorch==2.1.2 faiss-gpu==1.7.2 langdetect==1.0.9
127
  ```
128
 
129
+ Using the model on a collection of passages typically involves the following steps:
130
 
131
  - **Step 1: Indexing.** This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search. (⚠️ indexing requires a GPU!)
132
  ```
133
+ from . import CustomIndexer # Use of a custom indexer that automatically detects the language of the passages to index and activate the language-specific adapters accordingly
134
  from colbert.infra import Run, RunConfig
135
 
136
  n_gpu: int = 1 # Set your number of available GPUs
 
138
  index_name: str = "" # The name of your index, i.e. the name of your vector database
139
 
140
  with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
141
+ indexer = CustomIndexer(checkpoint="antoinelouis/colbert-xm")
142
  documents = [
143
  "Ceci est un premier document.",
144
  "Voici un second document.",
 
150
 
151
  - **Step 2: Searching.** Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
152
  ```
153
+ from . import CustomSearcher # Use of a custom searcher that automatically detects the language of the passages to index and activate the language-specific adapters accordingly
154
  from colbert.infra import Run, RunConfig
155
 
156
  n_gpu: int = 0
 
159
  k: int = 10 # how many results you want to retrieve
160
 
161
  with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
162
+ searcher = CustomSearcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
163
  query = "Comment effectuer une recherche avec ColBERT ?"
164
  results = searcher.search(query, k=k)
165
  # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
 
168
 
169
  ### Using RAGatouille
170
 
171
+ [To come]
172
 
173
  ***
174
 
175
  ## Evaluation
176
 
177
  - **MS MARCO**:
178
+ We evaluate our model on the small development sets of [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco), which consists of 6,980 queries for a corpus of 8.8M candidate passages in 14 languages. Below, we compared its multilingual performance with other retrieval models on the dataset official metrics, i.e., mean reciprocal rank at cut-off 10 (MRR@10).
179
 
180
  | | model | Type | #Samples | #Params | en | es | fr | it | pt | id | de | ru | zh | ja | nl | vi | hi | ar | Avg. |
181
  |---:|:----------------------------------------------------------------------------------------------------------------------------------------|:--------------|:--------:|:-------:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|
 
189
  | 7 | [DPR-XM](https://huggingface.co/antoinelouis/dpr-xm) (ours) | single-vector | 25.6M | 277M | 32.7 | 23.6 | 23.5 | 22.3 | 22.7 | 22.0 | 22.1 | 19.9 | 18.1 | 18.7 | 22.9 | 18.0 | 16.0 | 15.1 | 21.3 |
190
  | 8 | **ColBERT-XM** (ours) | multi-vector | 6.4M | 277M | 37.2 | 28.5 | 26.9 | 26.5 | 27.6 | 26.3 | 27.0 | 25.1 | 24.6 | 24.1 | 27.5 | 22.6 | 23.8 | 19.5 | 26.2 |
191
 
192
+ - **Mr. TyDi**:
193
+ - We also evaluate our model on the test set of [Mr. TyDi](https://huggingface.co/datasets/castorini/mr-tydi), another multilingual open retrieval dataset including low-resource languages not present in mMARCO. Below, we compared its performance with other retrieval models on the official dataset metrics, i.e., mean reciprocal rank at cut-off 100 (MRR@100) and recall at cut-off 100 (R@100).
194
+
195
+ | | model | Type | #Samples | #Params | ar | bn | en | fi | id | ja | ko | ru | sw | te | Avg. |
196
+ |---:|:------------------------------------------------------------------------------|:--------------|:--------:|:-------:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|
197
+ | | | | | | | | | | **MRR@100** | | | | | | |
198
+ | 1 | BM25 ([Pyserini](https://github.com/castorini/pyserini)) | lexical | - | - | 36.8 | 41.8 | 14.0 | 28.4 | 37.6 | 21.1 | 28.5 | 31.3 | 38.9 | 34.3 | 31.3 |
199
+ | 2 | mono-mT5 ([Bonfacio et al., 2021](https://doi.org/10.48550/arXiv.2108.13897)) | cross-encoder | 12.8M | 390M | 62.2 | 65.1 | 35.7 | 49.5 | 61.1 | 48.1 | 47.4 | 52.6 | 62.9 | 66.6 | 55.1 |
200
+ | 3 | mColBERT ([Bonfacio et al., 2021](https://doi.org/10.48550/arXiv.2108.13897)) | multi-vector | 25.6M | 180M | 55.3 | 48.8 | 32.9 | 41.3 | 55.5 | 36.6 | 36.7 | 48.2 | 44.8 | 61.6 | 46.1 |
201
+ | 4 | **ColBERT-XM** (ours) | multi-vector | 6.4M | 277M | 55.2 | 56.6 | 36.0 | 41.8 | 57.1 | 42.1 | 41.3 | 52.2 | 56.8 | 50.6 | 49.0 |
202
+ | | | | | | | | | | **R@100** | | | | | | |
203
+
204
  ***
205
 
206
  ## Training
207
 
208
  #### Data
209
 
210
+ We use the English training samples from the [MS MARCO passage ranking](https://ir-datasets.com/msmarco-passage.html#msmarco-passage/train) dataset, which contains 8.8M passages and 539K training queries. We do not employ the BM25 netaives provided by the official dataset but instead sample harder negatives mined from 12 distinct dense retrievers, using the [msmarco-hard-negatives](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives) distillation dataset. Our final training set consists of 6.4M (q, p+, p-) triples.
211
 
212
  #### Implementation
213