File size: 6,444 Bytes
faa6aeb
97d37d7
faa6aeb
b463025
faa6aeb
 
 
 
 
7627c1f
ee7bcda
7627c1f
 
defef1e
ee7bcda
 
 
 
 
 
 
 
 
 
 
 
973151c
 
 
ee7bcda
 
 
 
 
 
 
 
 
 
 
 
faa6aeb
 
ee7bcda
faa6aeb
ee7bcda
faa6aeb
e307298
97d37d7
ee7bcda
e307298
ee7bcda
e307298
 
 
 
ee7bcda
97d37d7
faa6aeb
e307298
faa6aeb
e307298
ee7bcda
97d37d7
e307298
 
faa6aeb
ee7bcda
 
 
faa6aeb
ee7bcda
 
 
faa6aeb
 
ee7bcda
e307298
 
 
 
ee7bcda
faa6aeb
 
e307298
faa6aeb
e307298
ee7bcda
 
e307298
ee7bcda
 
e307298
 
 
ee7bcda
 
 
 
faa6aeb
ee7bcda
 
 
 
 
97d37d7
faa6aeb
7627c1f
 
faa6aeb
 
4a94c57
a280fde
 
 
 
200a637
faa6aeb
7627c1f
 
faa6aeb
 
 
 
b463025
 
 
 
 
4a94c57
b463025
 
 
 
 
faa6aeb
ee7bcda
 
faa6aeb
 
 
 
 
ee7bcda
faa6aeb
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
pipeline_tag: sentence-similarity
language: fr
license: mit
datasets:
- unicamp-dl/mmarco
metrics:
- recall
tags:
- colbert
- passage-retrieval
base_model: camembert-base
library_name: RAGatouille
inference: false
model-index:
- name: colbertv1-camembert-base-mmarcoFR
  results:
    - task:
        type: sentence-similarity
        name: Passage Retrieval
      dataset:
        type: unicamp-dl/mmarco
        name: mMARCO-fr
        config: french
        split: validation
      metrics:
        - type: recall_at_1000
          name: Recall@1000
          value: 89.70
        - type: recall_at_500
          name: Recall@500
          value: 88.40
        - type: recall_at_100
          name: Recall@100
          value: 80.00
        - type: recall_at_10
          name: Recall@10
          value: 54.21
        - type: mrr_at_10
          name: MRR@10
          value: 29.51
---

# colbertv1-camembert-base-mmarcoFR

This is a [ColBERTv1](https://doi.org/10.48550/arXiv.2004.12832) model for semantic search. It encodes queries and passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.

## Usage

Here are some examples for using the model with [RAGatouille](https://github.com/bclavie/RAGatouille) or [colbert-ai](https://github.com/stanford-futuredata/ColBERT).

### Using RAGatouille

First, you will need to install the following libraries:

```bash
pip install -U ragatouille
```

Then, you can use the model like this:

```python
from ragatouille import RAGPretrainedModel

index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus

# Step 1: Indexing.
RAG = RAGPretrainedModel.from_pretrained("antoinelouis/colbertv1-camembert-base-mmarcoFR")
RAG.index(name=index_name, collection=documents)

# Step 2: Searching.
RAG = RAGPretrainedModel.from_index(index_name) # if not already loaded
RAG.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
```

### Using ColBERT-AI

First, you will need to install the following libraries:

```bash
pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
```

Then, you can use the model like this:

```python
from colbert import Indexer, Searcher
from colbert.infra import Run, RunConfig

n_gpu: int = 1 # Set your number of available GPUs
experiment: str = "colbert" # Name of the folder where the logs and created indices will be stored
index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus

# Step 1: Indexing. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
    indexer = Indexer(checkpoint="antoinelouis/colbertv1-camembert-base-mmarcoFR")
    indexer.index(name=index_name, collection=documents)

# Step 2: Searching. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
    searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
    results = searcher.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
    # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
```

***

## Evaluation

The model is evaluated on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages. Below, we compared its performance to a single-vector representation model fine-tuned on the same dataset. We report the mean reciprocal rank (MRR) and recall at various cut-offs (R@k).

| model                                                                                                                   | Vocab. | #Param. |  Size |   MRR@10 |   R@10 |   R@100(↑) |   R@500 |
|:------------------------------------------------------------------------------------------------------------------------|:-------|--------:|------:|---------:|-------:|-----------:|--------:|
| **colbertv1-camembert-base-mmarcoFR**                                                                                   |     🇫🇷 |    110M | 443MB |    29.51 |  54.21 |      80.00 |   88.40 |
| [biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR)              |     🇫🇷 |    110M | 443MB |    28.53 |  51.46 |      77.82 |   89.13 |

***

## Training

#### Data

We use the French training set from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, 
a multilingual machine-translated version of MS MARCO that contains 8.8M passages and 539K training queries. 
We sample 12.8M (q, p+, p-) triples from the official ~39.8M [training triples](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset).

#### Implementation

The model is initialized from the [camembert-base](https://huggingface.co/camembert-base) checkpoint and optimized via a combination of the pairwise softmax 
cross-entropy loss computed over predicted scores for the positive and hard negative passages (as in [ColBERTv1](https://doi.org/10.48550/arXiv.2004.12832)) 
and the in-batch sampled softmax cross-entropy loss (as in [ColBERTv2](https://doi.org/10.48550/arXiv.2112.01488)). It was trained on a single Tesla V100 GPU 
with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The embedding dimension was set 
to 128, and the maximum sequence lengths for questions and passages length were fixed to 32 and 256 tokens, respectively.

***

## Citation

```bibtex
@online{louis2023,
   author    = 'Antoine Louis',
   title     = 'colbertv1-camembert-base-mmarcoFR: The 1st ColBERT Model for French',
   publisher = 'Hugging Face',
   month     = 'dec',
   year      = '2023',
   url       = 'https://huggingface.co/antoinelouis/colbertv1-camembert-base-mmarcoFR',
}
```