antoinelouis commited on
Commit
ae50523
1 Parent(s): 3e39e6f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -29
README.md CHANGED
@@ -110,54 +110,41 @@ language:
110
  <p>
111
  </h4>
112
 
113
- This is a [colbert-ir](https://github.com/stanford-futuredata/ColBERT) model: it encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. It can be used for tasks like clustering or semantic search. The model uses an [XMOD](https://huggingface.co/facebook/xmod-base) backbone, which allows it to learn from monolingual fine-tuning in a high-resource language, like English, and perform zero-shot retrieval across multiple languages.
 
 
 
114
 
115
  ## Usage
116
 
117
- Start by installing the [library](https://github.com/stanford-futuredata/ColBERT) and some extra requirements:
118
 
119
- ```
120
  pip install git+https://github.com/stanford-futuredata/ColBERT.git@main#egg=colbert-ir torchtorch==2.1.2 faiss-gpu==1.7.2 langdetect==1.0.9
121
  ```
122
 
123
- Using the model on a collection of passages typically involves the following steps:
124
 
125
- - **Step 1: Indexing.** This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search. (⚠️ indexing requires a GPU!)
126
- ```
127
- from .custom import CustomIndexer # Use of a custom indexer that automatically detects the language of the passages to index and activate the language-specific adapters accordingly
128
  from colbert.infra import Run, RunConfig
129
 
130
  n_gpu: int = 1 # Set your number of available GPUs
131
- experiment: str = "" # Name of the folder where the logs and created indices will be stored
132
- index_name: str = "" # The name of your index, i.e. the name of your vector database
 
133
 
 
134
  with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
135
  indexer = CustomIndexer(checkpoint="antoinelouis/colbert-xm")
136
- documents = [
137
- "Ceci est un premier document.",
138
- "Voici un second document.",
139
- ...
140
- ]
141
  indexer.index(name=index_name, collection=documents)
142
 
143
- ```
144
-
145
- - **Step 2: Searching.** Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
146
- ```
147
- from .custom import CustomSearcher # Use of a custom searcher that automatically detects the language of the passages to index and activate the language-specific adapters accordingly
148
- from colbert.infra import Run, RunConfig
149
-
150
- n_gpu: int = 0
151
- experiment: str = "" # Name of the folder where the logs and created indices will be stored
152
- index_name: str = "" # Name of your previously created index where the documents you want to search are stored.
153
- k: int = 10 # how many results you want to retrieve
154
-
155
  with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
156
  searcher = CustomSearcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
157
- query = "Comment effectuer une recherche avec ColBERT ?"
158
- results = searcher.search(query, k=k)
159
  # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
160
-
161
  ```
162
 
163
  ***
 
110
  <p>
111
  </h4>
112
 
113
+ This is a [ColBERT](https://doi.org/10.48550/arXiv.2112.01488) model that can be used for semantic search in many languages.
114
+ It encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity
115
+ (MaxSim) operators. It can be used for tasks like clustering or semantic search. The model uses an [XMOD](https://huggingface.co/facebook/xmod-base) backbone,
116
+ which allows it to learn from monolingual fine-tuning in a high-resource language, like English, and perform zero-shot retrieval across multiple languages.
117
 
118
  ## Usage
119
 
120
+ Start by installing the [colbert-ir](https://github.com/stanford-futuredata/ColBERT) and some extra requirements:
121
 
122
+ ```bash
123
  pip install git+https://github.com/stanford-futuredata/ColBERT.git@main#egg=colbert-ir torchtorch==2.1.2 faiss-gpu==1.7.2 langdetect==1.0.9
124
  ```
125
 
126
+ Then, you can use the model like this:
127
 
128
+ ```python
129
+ # Use of custom modules that automatically detect the language of the passages to index and activate the language-specific adapters accordingly
130
+ from .custom import CustomIndexer, CustomSearcher
131
  from colbert.infra import Run, RunConfig
132
 
133
  n_gpu: int = 1 # Set your number of available GPUs
134
+ experiment: str = "colbert" # Name of the folder where the logs and created indices will be stored
135
+ index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
136
+ documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
137
 
138
+ # Step 1: Indexing. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
139
  with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
140
  indexer = CustomIndexer(checkpoint="antoinelouis/colbert-xm")
 
 
 
 
 
141
  indexer.index(name=index_name, collection=documents)
142
 
143
+ # Step 2: Searching. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
 
 
 
 
 
 
 
 
 
 
 
144
  with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
145
  searcher = CustomSearcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
146
+ results = searcher.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
 
147
  # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
 
148
  ```
149
 
150
  ***