antoinelouis
commited on
Commit
•
c21a032
1
Parent(s):
973151c
Update README.md
Browse files
README.md
CHANGED
@@ -43,7 +43,8 @@ model-index:
|
|
43 |
|
44 |
# colbertv1-camembert-base-mmarcoFR
|
45 |
|
46 |
-
This is a [ColBERTv1](https://doi.org/10.48550/arXiv.2004.12832) model for semantic search. It encodes queries and passages into matrices
|
|
|
47 |
|
48 |
## Usage
|
49 |
|
@@ -105,18 +106,20 @@ with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
|
|
105 |
# results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
|
106 |
```
|
107 |
|
108 |
-
***
|
109 |
-
|
110 |
## Evaluation
|
111 |
|
112 |
-
The model is evaluated on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of
|
|
|
|
|
|
|
113 |
|
114 |
-
| model
|
115 |
-
|
116 |
-
|
|
117 |
-
| [
|
|
|
118 |
|
119 |
-
|
120 |
|
121 |
## Training
|
122 |
|
@@ -134,17 +137,15 @@ and the in-batch sampled softmax cross-entropy loss (as in [ColBERTv2](https://d
|
|
134 |
with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The embedding dimension was set
|
135 |
to 128, and the maximum sequence lengths for questions and passages length were fixed to 32 and 256 tokens, respectively.
|
136 |
|
137 |
-
***
|
138 |
-
|
139 |
## Citation
|
140 |
|
141 |
```bibtex
|
142 |
-
@online{
|
143 |
-
|
144 |
-
|
145 |
-
|
146 |
-
|
147 |
-
|
148 |
-
|
149 |
}
|
150 |
```
|
|
|
43 |
|
44 |
# colbertv1-camembert-base-mmarcoFR
|
45 |
|
46 |
+
This is a [ColBERTv1](https://doi.org/10.48550/arXiv.2004.12832) model for **French** that can be used for semantic search. It encodes queries and passages into matrices
|
47 |
+
of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators.
|
48 |
|
49 |
## Usage
|
50 |
|
|
|
106 |
# results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
|
107 |
```
|
108 |
|
|
|
|
|
109 |
## Evaluation
|
110 |
|
111 |
+
The model is evaluated on the smaller development set of [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/), which consists of 6,980 queries for a corpus of
|
112 |
+
8.8M candidate passages. We report the mean reciprocal rank (MRR), normalized discounted cumulative gainand (NDCG), mean average precision (MAP), and recall at various cut-offs (R@k).
|
113 |
+
Below, we compare its performance with other publicly available French ColBERT models fine-tuned on the same dataset. To see how it compares to other neural retrievers in French,
|
114 |
+
check out the [*DécouvrIR*](https://huggingface.co/spaces/antoinelouis/decouvrir) leaderboard.
|
115 |
|
116 |
+
| model | #Param.(↓) | Size | Dim. | Index | R@1000 | R@500 | R@100 | R@10 | MRR@10 |
|
117 |
+
|:-----------------------------------------------------------------------------------------------------------|-----------:|------:|-----:|------:|-------:|------:|------:|-----:|-------:|
|
118 |
+
| [colbertv2-camembert-L4-mmarcoFR](https://huggingface.co/antoinelouis/colbertv2-camembert-L4-mmarcoFR) | 54M | 0.2GB | 32 | 9GB | 91.9 | 90.3 | 81.9 | 56.7 | 32.3 |
|
119 |
+
| [FraColBERTv2](https://huggingface.co/bclavie/FraColBERTv2) | 111M | 0.4GB | 128 | 28GB | 90.0 | 88.9 | 81.2 | 57.1 | 32.4 |
|
120 |
+
| **colbertv1-camembert-base-mmarcoFR** | 111M | 0.4GB | 128 | 28GB | 89.7 | 88.4 | 80.0 | 54.2 | 29.5 |
|
121 |
|
122 |
+
NB: Index corresponds to the size of the mMARCO-fr index (8.8M passages) on disk when using ColBERTv2's residual compression mechanism.
|
123 |
|
124 |
## Training
|
125 |
|
|
|
137 |
with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The embedding dimension was set
|
138 |
to 128, and the maximum sequence lengths for questions and passages length were fixed to 32 and 256 tokens, respectively.
|
139 |
|
|
|
|
|
140 |
## Citation
|
141 |
|
142 |
```bibtex
|
143 |
+
@online{louis2024decouvrir,
|
144 |
+
author = 'Antoine Louis',
|
145 |
+
title = 'DécouvrIR: A Benchmark for Evaluating the Robustness of Information Retrieval Models in French',
|
146 |
+
publisher = 'Hugging Face',
|
147 |
+
month = 'mar',
|
148 |
+
year = '2024',
|
149 |
+
url = 'https://huggingface.co/spaces/antoinelouis/decouvrir',
|
150 |
}
|
151 |
```
|