antoinelouis
/

colbertv1-camembert-base-mmarcoFR

@@ -1,5 +1,5 @@
 ---
-pipeline_tag: feature-extraction
 language: fr
 license: apache-2.0
 datasets:
@@ -17,46 +17,53 @@ inference: false
 This is a [ColBERTv1](https://github.com/stanford-futuredata/ColBERT) model: it encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. It can be used for tasks like clustering or semantic search. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
-## Usage
-Using ColBERT on a dataset typically involves the following steps:
-**Step 1: Preprocess your collection.** At its simplest, ColBERT works with tab-separated (TSV) files: a file (e.g., `collection.tsv`) will contain all passages and another (e.g., `queries.tsv`) will contain a set of queries for searching the collection.
-**Step 2: Index your collection.** This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
 ```
-from colbert.infra import Run, RunConfig, ColBERTConfig
 from colbert import Indexer
-if __name__=='__main__':
-    with Run().context(RunConfig(nranks=1, experiment="msmarco")):
-        config = ColBERTConfig(
-            nbits=2,
-            root="/path/to/experiments",
-        )
-        indexer = Indexer(checkpoint="/path/to/checkpoint", config=config)
-        indexer.index(name="msmarco.nbits=2", collection="/path/to/MSMARCO/collection.tsv")
 ```
-**Step 3: Search the collection with your queries.** Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
 ```
-from colbert.data import Queries
-from colbert.infra import Run, RunConfig, ColBERTConfig
 from colbert import Searcher
-if __name__=='__main__':
-    with Run().context(RunConfig(nranks=1, experiment="msmarco")):
-        config = ColBERTConfig(
-            root="/path/to/experiments",
-        )
-        searcher = Searcher(index="msmarco.nbits=2", config=config)
-        queries = Queries("/path/to/MSMARCO/queries.dev.small.tsv")
-        ranking = searcher.search_all(queries, k=100)
-        ranking.save("msmarco.nbits=2.ranking.tsv")
-```
 ## Evaluation

 ---
+pipeline_tag: sentence-similarity
 language: fr
 license: apache-2.0
 datasets:
 This is a [ColBERTv1](https://github.com/stanford-futuredata/ColBERT) model: it encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. It can be used for tasks like clustering or semantic search. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
+## Installation
+To use this model, you will need to install the following libraries:
+```
+pip install colbert-ir[faiss-gpu] faiss torch
+```
+## Usage
+**Step 1: Indexing.** This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search. ⚠️ ColBERT indexing requires a GPU!
 ```
 from colbert import Indexer
+from colbert.infra import Run, RunConfig
+n_gpu: int = 1 # Set your number of available GPUs
+experiment: str = "" # Name of the folder where the logs and created indices will be stored
+index_name: str = "" # The name of your index, i.e. the name of your vector database
+with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
+    indexer = Indexer(checkpoint="antoinelouis/colbertv1-camembert-base-mmarcoFR")
+    documents = [
+      "Ceci est un premier document.",
+      "Voici un second document.",
+      ...
+    ]
+    indexer.index(name=index_name, collection=documents)
 ```
+**Step 2: Searching.** Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
 ```
 from colbert import Searcher
+from colbert.infra import Run, RunConfig
+n_gpu: int = 0
+experiment: str = "" # Name of the folder where the logs and created indices will be stored
+index_name: str = "" # Name of your previously created index where the documents you want to search are stored.
+k: int = 10 # how many results you want to retrieve
+with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
+    searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
+    query = "Comment effectuer une recherche avec ColBERT ?"
+    results = searcher.search(query, k=k)
+    # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
+```
 ## Evaluation