antoinelouis
commited on
Commit
•
78311e3
1
Parent(s):
02acde6
Update README.md
Browse files
README.md
CHANGED
@@ -10,6 +10,36 @@ tags:
|
|
10 |
- passage-reranking
|
11 |
library_name: sentence-transformers
|
12 |
base_model: dbmdz/electra-base-french-europeana-cased-discriminator
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
---
|
14 |
|
15 |
# crossencoder-electra-base-french-mmarcoFR
|
@@ -75,17 +105,10 @@ print(scores)
|
|
75 |
|
76 |
## Evaluation
|
77 |
|
78 |
-
|
79 |
-
|
80 |
-
|
81 |
-
|
82 |
-
| | model | Vocab. | #Param. | Size | RP | MRR@10 | R@10(↑) | R@20 | R@50 | R@100 |
|
83 |
-
|---:|:-----------------------------------------------------------------------------------------------------------------------------|:-------|--------:|------:|-------:|---------:|---------:|-------:|-------:|--------:|
|
84 |
-
| 1 | [crossencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-camembert-base-mmarcoFR) | fr | 110M | 443MB | 35.65 | 50.44 | 82.95 | 91.50 | 96.80 | 98.80 |
|
85 |
-
| 2 | [crossencoder-mMiniLMv2-L12-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L12-mmarcoFR) | fr,99+ | 118M | 471MB | 34.37 | 51.01 | 82.23 | 90.60 | 96.45 | 98.40 |
|
86 |
-
| 3 | [crossencoder-distilcamembert-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-distilcamembert-mmarcoFR) | fr | 68M | 272MB | 27.28 | 43.71 | 80.30 | 89.10 | 95.55 | 98.60 |
|
87 |
-
| 4 | **crossencoder-electra-base-french-mmarcoFR** | fr | 110M | 443MB | 28.32 | 45.28 | 79.22 | 87.15 | 93.15 | 95.75 |
|
88 |
-
| 5 | [crossencoder-mMiniLMv2-L6-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L6-mmarcoFR) | fr,99+ | 107M | 428MB | 33.92 | 49.33 | 79.00 | 88.35 | 94.80 | 98.20 |
|
89 |
|
90 |
***
|
91 |
|
@@ -94,28 +117,29 @@ cross-encoder models fine-tuned on the same dataset. We report the R-precision (
|
|
94 |
#### Data
|
95 |
|
96 |
We use the French training samples from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multilingual machine-translated version of MS MARCO
|
97 |
-
that contains 8.8M passages and 539K training queries. We
|
98 |
-
[
|
99 |
-
|
|
|
100 |
|
101 |
#### Implementation
|
102 |
|
103 |
The model is initialized from the [dbmdz/electra-base-french-europeana-cased-discriminator](https://huggingface.co/dbmdz/electra-base-french-europeana-cased-discriminator) checkpoint and optimized via the binary cross-entropy loss
|
104 |
-
(as in [monoBERT](https://doi.org/10.48550/arXiv.1910.14424)). It is fine-tuned on one
|
105 |
-
with a batch size of
|
106 |
-
|
107 |
|
108 |
***
|
109 |
|
110 |
## Citation
|
111 |
|
112 |
```bibtex
|
113 |
-
@online{
|
114 |
-
|
115 |
-
|
116 |
-
|
117 |
-
|
118 |
-
|
119 |
-
|
120 |
}
|
121 |
```
|
|
|
10 |
- passage-reranking
|
11 |
library_name: sentence-transformers
|
12 |
base_model: dbmdz/electra-base-french-europeana-cased-discriminator
|
13 |
+
model-index:
|
14 |
+
- name: crossencoder-electra-base-french-mmarcoFR
|
15 |
+
results:
|
16 |
+
- task:
|
17 |
+
type: text-classification
|
18 |
+
name: Passage Rerankingg
|
19 |
+
dataset:
|
20 |
+
type: unicamp-dl/mmarco
|
21 |
+
name: mMARCO-fr
|
22 |
+
config: french
|
23 |
+
split: validation
|
24 |
+
metrics:
|
25 |
+
- type: recall_at_500
|
26 |
+
name: Recall@500
|
27 |
+
value: 0.0
|
28 |
+
- type: recall_at_100
|
29 |
+
name: Recall@100
|
30 |
+
value: 0.0
|
31 |
+
- type: recall_at_10
|
32 |
+
name: Recall@10
|
33 |
+
value: 0.0
|
34 |
+
- type: map_at_10
|
35 |
+
name: MAP@10
|
36 |
+
value: 0.0
|
37 |
+
- type: ndcg_at_10
|
38 |
+
name: nDCG@10
|
39 |
+
value: 0.0
|
40 |
+
- type: mrr_at_10
|
41 |
+
name: MRR@10
|
42 |
+
value: 0.0
|
43 |
---
|
44 |
|
45 |
# crossencoder-electra-base-french-mmarcoFR
|
|
|
105 |
|
106 |
## Evaluation
|
107 |
|
108 |
+
The model is evaluated on the smaller development set of [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/), which consists of 6,980 queries for which
|
109 |
+
an ensemble of 1000 passages containing the positive(s) and [ColBERTv2 hard negatives](https://huggingface.co/datasets/antoinelouis/msmarco-dev-small-negatives) need
|
110 |
+
to be reranked. We report the mean reciprocal rank (MRR), normalized discounted cumulative gainand (NDCG), mean average precision (MAP), and recall at various cut-offs
|
111 |
+
(R@k). To see how it compares to other neural retrievers in French, check out the [*DécouvrIR*](https://huggingface.co/spaces/antoinelouis/decouvrir) leaderboard.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
112 |
|
113 |
***
|
114 |
|
|
|
117 |
#### Data
|
118 |
|
119 |
We use the French training samples from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multilingual machine-translated version of MS MARCO
|
120 |
+
that contains 8.8M passages and 539K training queries. We do not use the BM25 negatives provided by the official dataset but instead sample harder negatives mined from
|
121 |
+
12 distinct dense retrievers, using the [msmarco-hard-negatives](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives#msmarco-hard-negativesjsonlgz)
|
122 |
+
distillation dataset. Eventually, we sample 2.6M training triplets of the form (query, passage, relevance) with a positive-to-negative ratio of 1 (i.e., 50% of the pairs are
|
123 |
+
relevant and 50% are irrelevant).
|
124 |
|
125 |
#### Implementation
|
126 |
|
127 |
The model is initialized from the [dbmdz/electra-base-french-europeana-cased-discriminator](https://huggingface.co/dbmdz/electra-base-french-europeana-cased-discriminator) checkpoint and optimized via the binary cross-entropy loss
|
128 |
+
(as in [monoBERT](https://doi.org/10.48550/arXiv.1910.14424)). It is fine-tuned on one 80GB NVIDIA H100 GPU for 20k steps using the AdamW optimizer
|
129 |
+
with a batch size of 128 and a constant learning rate of 2e-5. We set the maximum sequence length of the concatenated question-passage pairs to 256 tokens.
|
130 |
+
We use the sigmoid function to get scores between 0 and 1.
|
131 |
|
132 |
***
|
133 |
|
134 |
## Citation
|
135 |
|
136 |
```bibtex
|
137 |
+
@online{louis2024decouvrir,
|
138 |
+
author = 'Antoine Louis',
|
139 |
+
title = 'DécouvrIR: A Benchmark for Evaluating the Robustness of Information Retrieval Models in French',
|
140 |
+
publisher = 'Hugging Face',
|
141 |
+
month = 'mar',
|
142 |
+
year = '2024',
|
143 |
+
url = 'https://huggingface.co/spaces/antoinelouis/decouvrir',
|
144 |
}
|
145 |
```
|