antoinelouis commited on
Commit
e307298
1 Parent(s): b463025

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -27
README.md CHANGED
@@ -15,53 +15,66 @@ inference: false
15
 
16
  # colbertv1-camembert-base-mmarcoFR
17
 
18
- This is a [ColBERTv1](https://github.com/stanford-futuredata/ColBERT) model: it encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. It can be used for tasks like clustering or semantic search. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
19
 
20
- ## Installation
21
 
22
- To use this model, you will need to install the following libraries:
23
- ```
 
 
 
 
 
24
  pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
25
  ```
26
 
27
- ## Usage
28
 
29
- **Step 1: Indexing.** This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search. ⚠️ ColBERT indexing requires a GPU!
30
- ```
31
- from colbert import Indexer
32
  from colbert.infra import Run, RunConfig
33
 
34
  n_gpu: int = 1 # Set your number of available GPUs
35
- experiment: str = "" # Name of the folder where the logs and created indices will be stored
36
- index_name: str = "" # The name of your index, i.e. the name of your vector database
 
37
 
 
38
  with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
39
  indexer = Indexer(checkpoint="antoinelouis/colbertv1-camembert-base-mmarcoFR")
40
- documents = [
41
- "Ceci est un premier document.",
42
- "Voici un second document.",
43
- ...
44
- ]
45
  indexer.index(name=index_name, collection=documents)
46
 
 
 
 
 
 
47
  ```
48
 
49
- **Step 2: Searching.** Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
 
 
 
 
 
50
  ```
51
- from colbert import Searcher
52
- from colbert.infra import Run, RunConfig
53
 
54
- n_gpu: int = 0
55
- experiment: str = "" # Name of the folder where the logs and created indices will be stored
56
- index_name: str = "" # Name of your previously created index where the documents you want to search are stored.
57
- k: int = 10 # how many results you want to retrieve
58
 
59
- with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
60
- searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
61
- query = "Comment effectuer une recherche avec ColBERT ?"
62
- results = searcher.search(query, k=k)
63
- # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
 
 
 
 
64
 
 
 
 
65
  ```
66
 
67
  ## Evaluation
 
15
 
16
  # colbertv1-camembert-base-mmarcoFR
17
 
18
+ This is a [ColBERTv1](https://github.com/stanford-futuredata/ColBERT) model for semantic search. It encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
19
 
20
+ ## Usage
21
 
22
+ Here are some examples for using the model with [colbert-ai](https://github.com/stanford-futuredata/ColBERT) or [RAGatouille](https://github.com/bclavie/RAGatouille).
23
+
24
+ ### Using ColBERT-AI
25
+
26
+ First, you will need to install the following libraries:
27
+
28
+ ```bash
29
  pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
30
  ```
31
 
32
+ Then, you can use the model like this:
33
 
34
+ ```python
35
+ from colbert import Indexer, Searcher
 
36
  from colbert.infra import Run, RunConfig
37
 
38
  n_gpu: int = 1 # Set your number of available GPUs
39
+ experiment: str = "colbert" # Name of the folder where the logs and created indices will be stored
40
+ index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
41
+ documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
42
 
43
+ # Step 1: Indexing. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
44
  with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
45
  indexer = Indexer(checkpoint="antoinelouis/colbertv1-camembert-base-mmarcoFR")
 
 
 
 
 
46
  indexer.index(name=index_name, collection=documents)
47
 
48
+ # Step 2: Searching. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
49
+ with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
50
+ searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
51
+ results = searcher.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
52
+ # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
53
  ```
54
 
55
+ ### Using RAGatouille
56
+
57
+ First, you will need to install the following libraries:
58
+
59
+ ```bash
60
+ pip install -U ragatouille
61
  ```
 
 
62
 
63
+ Then, you can use the model like this:
 
 
 
64
 
65
+ ```python
66
+ from ragatouille import RAGPretrainedModel
67
+
68
+ index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
69
+ documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
70
+
71
+ # Step 1: Indexing.
72
+ RAG = RAGPretrainedModel.from_pretrained("antoinelouis/colbertv1-camembert-base-mmarcoFR")
73
+ RAG.index(name=index_name, collection=documents)
74
 
75
+ # Step 2: Searching.
76
+ RAG = RAGPretrainedModel.from_index(index_name) # if not already loaded
77
+ RAG.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
78
  ```
79
 
80
  ## Evaluation