antoinelouis commited on
Commit
91dba35
1 Parent(s): 41e2a7b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -27
README.md CHANGED
@@ -7,75 +7,102 @@ datasets:
7
  metrics:
8
  - recall
9
  tags:
10
- - sentence-similarity
11
  - colbert
 
12
  base_model: antoinelouis/camembert-L4
13
  library_name: RAGatouille
14
  inference: false
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  ---
16
 
17
- # 🇫🇷 colbertv2-camembert-L4-mmarcoFR
18
 
19
  This is a lightweight [ColBERTv2](https://doi.org/10.48550/arXiv.2112.01488) model for **French** that can be used for semantic search. It encodes queries and passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators.
20
 
21
  ## Usage
22
 
23
- Here are some examples for using the model with [colbert-ai](https://github.com/stanford-futuredata/ColBERT) or [RAGatouille](https://github.com/bclavie/RAGatouille).
24
 
25
- ### Using ColBERT-AI
26
 
27
  First, you will need to install the following libraries:
28
 
29
  ```bash
30
- pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
31
  ```
32
 
33
  Then, you can use the model like this:
34
 
35
  ```python
36
- from colbert import Indexer, Searcher
37
- from colbert.infra import Run, RunConfig
38
 
39
- n_gpu: int = 1 # Set your number of available GPUs
40
- experiment: str = "colbert" # Name of the folder where the logs and created indices will be stored
41
  index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
42
  documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
43
 
44
- # Step 1: Indexing. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
45
- with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
46
- indexer = Indexer(checkpoint="antoinelouis/colbertv2-camembert-L4-mmarcoFR")
47
- indexer.index(name=index_name, collection=documents)
48
 
49
- # Step 2: Searching. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
50
- with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
51
- searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
52
- results = searcher.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
53
- # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
54
  ```
55
 
56
- ### Using RAGatouille
57
 
58
  First, you will need to install the following libraries:
59
 
60
  ```bash
61
- pip install -U ragatouille
62
  ```
63
 
64
  Then, you can use the model like this:
65
 
66
  ```python
67
- from ragatouille import RAGPretrainedModel
 
68
 
 
 
69
  index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
70
  documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
71
 
72
- # Step 1: Indexing.
73
- RAG = RAGPretrainedModel.from_pretrained("antoinelouis/colbertv2-camembert-L4-mmarcoFR")
74
- RAG.index(name=index_name, collection=documents)
 
75
 
76
- # Step 2: Searching.
77
- RAG = RAGPretrainedModel.from_index(index_name) # if not already loaded
78
- RAG.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
 
 
79
  ```
80
 
81
  ***
 
7
  metrics:
8
  - recall
9
  tags:
 
10
  - colbert
11
+ - passage-retrieval
12
  base_model: antoinelouis/camembert-L4
13
  library_name: RAGatouille
14
  inference: false
15
+ model-index:
16
+ - name: colbertv2-camembert-L4-mmarcoFR
17
+ results:
18
+ - task:
19
+ type: sentence-similarity
20
+ name: Passage Retrieval
21
+ dataset:
22
+ type: unicamp-dl/mmarco
23
+ name: mMARCO-fr
24
+ config: french
25
+ split: validation
26
+ metrics:
27
+ - type: recall_at_1000
28
+ name: Recall@1000
29
+ value: 91.9
30
+ - type: recall_at_500
31
+ name: Recall@500
32
+ value: 90.3
33
+ - type: recall_at_100
34
+ name: Recall@100
35
+ value: 81.9
36
+ - type: recall_at_10
37
+ name: Recall@10
38
+ value: 56.7
39
+ - type: mrr_at_10
40
+ name: MRR@10
41
+ value: 32.3
42
  ---
43
 
44
+ # colbertv2-camembert-L4-mmarcoFR
45
 
46
  This is a lightweight [ColBERTv2](https://doi.org/10.48550/arXiv.2112.01488) model for **French** that can be used for semantic search. It encodes queries and passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators.
47
 
48
  ## Usage
49
 
50
+ Here are some examples for using the model with [RAGatouille](https://github.com/bclavie/RAGatouille) or [colbert-ai](https://github.com/stanford-futuredata/ColBERT).
51
 
52
+ ### Using RAGatouille
53
 
54
  First, you will need to install the following libraries:
55
 
56
  ```bash
57
+ pip install -U ragatouille
58
  ```
59
 
60
  Then, you can use the model like this:
61
 
62
  ```python
63
+ from ragatouille import RAGPretrainedModel
 
64
 
 
 
65
  index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
66
  documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
67
 
68
+ # Step 1: Indexing.
69
+ RAG = RAGPretrainedModel.from_pretrained("antoinelouis/colbertv2-camembert-L4-mmarcoFR")
70
+ RAG.index(name=index_name, collection=documents)
 
71
 
72
+ # Step 2: Searching.
73
+ RAG = RAGPretrainedModel.from_index(index_name) # if not already loaded
74
+ RAG.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
 
 
75
  ```
76
 
77
+ ### Using ColBERT-AI
78
 
79
  First, you will need to install the following libraries:
80
 
81
  ```bash
82
+ pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
83
  ```
84
 
85
  Then, you can use the model like this:
86
 
87
  ```python
88
+ from colbert import Indexer, Searcher
89
+ from colbert.infra import Run, RunConfig
90
 
91
+ n_gpu: int = 1 # Set your number of available GPUs
92
+ experiment: str = "colbert" # Name of the folder where the logs and created indices will be stored
93
  index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
94
  documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
95
 
96
+ # Step 1: Indexing. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
97
+ with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
98
+ indexer = Indexer(checkpoint="antoinelouis/colbertv2-camembert-L4-mmarcoFR")
99
+ indexer.index(name=index_name, collection=documents)
100
 
101
+ # Step 2: Searching. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
102
+ with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
103
+ searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
104
+ results = searcher.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
105
+ # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
106
  ```
107
 
108
  ***