antoinelouis
commited on
Commit
•
44c68f3
1
Parent(s):
2c37eac
Update README.md
Browse files
README.md
CHANGED
@@ -44,8 +44,9 @@ model-index:
|
|
44 |
|
45 |
# biencoder-camembert-L4-mmarcoFR
|
46 |
|
47 |
-
This is a lightweight dense single-vector bi-encoder model for French
|
48 |
-
The model
|
|
|
49 |
checkpoint with 51% less parameters, obtained by [dropping the top-layers](https://doi.org/10.48550/arXiv.2004.03844) from the original model.
|
50 |
|
51 |
## Usage
|
@@ -126,22 +127,11 @@ similarity = q_embeddings @ p_embeddings.T
|
|
126 |
print(similarity)
|
127 |
```
|
128 |
|
129 |
-
***
|
130 |
-
|
131 |
## Evaluation
|
132 |
|
133 |
-
|
134 |
-
|
135 |
-
|
136 |
-
|---:|:-------------------------------------------------------------------------------------------------------------|--------:|------:|-------:|---------:|-------:|-------:|--------:|-------:|
|
137 |
-
| 1 | [biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR) | 111M | 445MB | 89.1 | 77.8 | 51.5 | 28.5 | 33.7 | 27.9 |
|
138 |
-
| 2 | [biencoder-camembert-L10-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-L10-mmarcoFR) | 96M | 386MB | 87.8 | 76.7 | 49.5 | 27.5 | 32.5 | 27.0 |
|
139 |
-
| 3 | [biencoder-camembert-L8-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-L8-mmarcoFR) | 82M | 329MB | 87.4 | 75.9 | 48.9 | 26.7 | 31.8 | 26.2 |
|
140 |
-
| 4 | [biencoder-camembert-L6-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-L6-mmarcoFR) | 68M | 272MB | 86.7 | 74.9 | 46.7 | 25.7 | 30.4 | 25.1 |
|
141 |
-
| 5 | **biencoder-camembert-L4-mmarcoFR** | 54M | 216MB | 85.4 | 72.1 | 44.2 | 23.7 | 28.3 | 23.2 |
|
142 |
-
| 6 | [biencoder-camembert-L2-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-L2-mmarcoFR) | 40M | 159MB | 81.0 | 66.3 | 38.5 | 20.1 | 24.3 | 19.7 |
|
143 |
-
|
144 |
-
***
|
145 |
|
146 |
## Training
|
147 |
|
@@ -158,17 +148,14 @@ The model is initialized from the [camembert-L4](https://huggingface.co/antoinel
|
|
158 |
using the AdamW optimizer with a batch size of 1152, a peak learning rate of 2e-5 with warm up along the first 1736 steps and linear scheduling.
|
159 |
We set the maximum sequence lengths for both the questions and passages to 128 tokens. We use the cosine similarity to compute relevance scores.
|
160 |
|
161 |
-
***
|
162 |
-
|
163 |
## Citation
|
164 |
|
165 |
```bibtex
|
166 |
-
@online{
|
167 |
-
|
168 |
-
|
169 |
-
|
170 |
-
|
171 |
-
|
172 |
-
|
173 |
-
}
|
174 |
-
```
|
|
|
44 |
|
45 |
# biencoder-camembert-L4-mmarcoFR
|
46 |
|
47 |
+
This is a lightweight dense single-vector bi-encoder model for **French** that can be used for semantic search.
|
48 |
+
The model maps queries and passages to 768-dimensional dense vectors which are used to compute relevance through cosine similarity.
|
49 |
+
It uses a [CamemBERT-L4](https://huggingface.co/antoinelouis/camembert-L4) backbone, which is a pruned version of the pre-trained [CamemBERT](https://huggingface.co/camembert-base)
|
50 |
checkpoint with 51% less parameters, obtained by [dropping the top-layers](https://doi.org/10.48550/arXiv.2004.03844) from the original model.
|
51 |
|
52 |
## Usage
|
|
|
127 |
print(similarity)
|
128 |
```
|
129 |
|
|
|
|
|
130 |
## Evaluation
|
131 |
|
132 |
+
The model is evaluated on the smaller development set of [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/), which consists of 6,980 queries for a corpus of
|
133 |
+
8.8M candidate passages. We report the mean reciprocal rank (MRR), normalized discounted cumulative gainand (NDCG), mean average precision (MAP), and recall at various cut-offs (R@k).
|
134 |
+
To see how it compares to other neural retrievers in French, check out the [*DécouvrIR*](https://huggingface.co/spaces/antoinelouis/decouvrir) leaderboard.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
135 |
|
136 |
## Training
|
137 |
|
|
|
148 |
using the AdamW optimizer with a batch size of 1152, a peak learning rate of 2e-5 with warm up along the first 1736 steps and linear scheduling.
|
149 |
We set the maximum sequence lengths for both the questions and passages to 128 tokens. We use the cosine similarity to compute relevance scores.
|
150 |
|
|
|
|
|
151 |
## Citation
|
152 |
|
153 |
```bibtex
|
154 |
+
@online{louis2024decouvrir,
|
155 |
+
author = 'Antoine Louis',
|
156 |
+
title = 'DécouvrIR: A Benchmark for Evaluating the Robustness of Information Retrieval Models in French',
|
157 |
+
publisher = 'Hugging Face',
|
158 |
+
month = 'mar',
|
159 |
+
year = '2024',
|
160 |
+
url = 'https://huggingface.co/spaces/antoinelouis/decouvrir',
|
161 |
+
}
|
|