|
--- |
|
pipeline_tag: sentence-similarity |
|
language: fr |
|
license: apache-2.0 |
|
datasets: |
|
- unicamp-dl/mmarco |
|
metrics: |
|
- recall |
|
tags: |
|
- sentence-similarity |
|
library_name: sentence-transformers |
|
--- |
|
# crossencoder-electra-base-french-europeana-cased-discriminator-mmarcoFR |
|
|
|
This is a [sentence-transformers](https://www.SBERT.net) model trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset. |
|
|
|
It performs cross-attention between a question-passage pair and outputs a relevance score between 0 and 1. The model can be used for tasks like clustering or [semantic search]((https://www.sbert.net/examples/applications/retrieve_rerank/README.html): given a query, encode the latter with some candidate passages -- e.g., retrieved with BM25 or a biencoder -- then sort the passages in a decreasing order of relevance according to the model's predictions. |
|
|
|
## Usage |
|
*** |
|
|
|
#### Sentence-Transformers |
|
|
|
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed: |
|
|
|
```bash |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
Then you can use the model like this: |
|
|
|
```python |
|
from sentence_transformers import CrossEncoder |
|
pairs = [('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')] |
|
|
|
model = CrossEncoder('crossencoder-electra-base-french-europeana-cased-discriminator-mmarcoFR') |
|
scores = model.predict(pairs) |
|
print(scores) |
|
``` |
|
|
|
#### 🤗 Transformers |
|
|
|
Without [sentence-transformers](https://www.SBERT.net), you can use the model as follows: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
import torch |
|
|
|
model = AutoModelForSequenceClassification.from_pretrained('crossencoder-electra-base-french-europeana-cased-discriminator-mmarcoFR') |
|
tokenizer = AutoTokenizer.from_pretrained('crossencoder-electra-base-french-europeana-cased-discriminator-mmarcoFR') |
|
|
|
pairs = [('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')] |
|
features = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt') |
|
|
|
model.eval() |
|
with torch.no_grad(): |
|
scores = model(**features).logits |
|
print(scores) |
|
``` |
|
|
|
## Evaluation |
|
*** |
|
|
|
We evaluated our model on 500 random queries from the mMARCO-fr train set (which were excluded from training). Each of these queries has at least one relevant and up to 200 irrelevant passages. |
|
|
|
| r-precision | mrr@10 | recall@10 | recall@20 | recall@50 | recall@100 | |
|
|--------------:|---------:|------------:|------------:|------------:|-------------:| |
|
| 28.32 | 45.28 | 79.22 | 87.15 | 93.15 | 95.75 | |
|
|
|
Below, we compared its results with other cross-encoder models fine-tuned on the same dataset: |
|
| | model | r-precision | mrr@10 | recall@10 (↑) | recall@20 | recall@50 | recall@100 | |
|
|---:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------:|---------:|------------:|------------:|------------:|-------------:| |
|
| 1 | [crossencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-camembert-base-mmarcoFR) | 35.65 | 50.44 | 82.95 | 91.5 | 96.8 | 98.8 | |
|
| 2 | [crossencoder-mMiniLMv2-L12-H384-distilled-from-XLMR-Large-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L12-H384-distilled-from-XLMR-Large-mmarcoFR) | 34.37 | 51.01 | 82.23 | 90.6 | 96.45 | 98.4 | |
|
| 3 | [crossencoder-mmarcoFR-mMiniLMv2-L12-H384-v1-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mmarcoFR-mMiniLMv2-L12-H384-v1-mmarcoFR) | 34.22 | 49.2 | 81.7 | 90.9 | 97.1 | 98.9 | |
|
| 4 | [crossencoder-mpnet-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mpnet-base-mmarcoFR) | 29.68 | 46.13 | 80.45 | 87.9 | 93.15 | 96.6 | |
|
| 5 | [crossencoder-distilcamembert-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-distilcamembert-base-mmarcoFR) | 27.28 | 43.71 | 80.3 | 89.1 | 95.55 | 98.6 | |
|
| 6 | [crossencoder-roberta-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-roberta-base-mmarcoFR) | 33.33 | 48.87 | 79.33 | 86.75 | 94.15 | 97.6 | |
|
| 7 | **crossencoder-electra-base-french-europeana-cased-discriminator-mmarcoFR** | 28.32 | 45.28 | 79.22 | 87.15 | 93.15 | 95.75 | |
|
| 8 | [crossencoder-mMiniLMv2-L6-H384-distilled-from-XLMR-Large-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L6-H384-distilled-from-XLMR-Large-mmarcoFR) | 33.92 | 49.33 | 79 | 88.35 | 94.8 | 98.2 | |
|
| 9 | [crossencoder-msmarco-electra-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-electra-base-mmarcoFR) | 25.52 | 42.46 | 78.73 | 88.85 | 96.55 | 98.85 | |
|
| 10 | [crossencoder-bert-base-uncased-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-bert-base-uncased-mmarcoFR) | 30.48 | 45.79 | 78.35 | 89.45 | 94.15 | 97.45 | |
|
| 11 | [crossencoder-msmarco-MiniLM-L-12-v2-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-MiniLM-L-12-v2-mmarcoFR) | 29.07 | 44.41 | 77.83 | 88.1 | 95.55 | 99 | |
|
| 12 | [crossencoder-msmarco-MiniLM-L-6-v2-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-MiniLM-L-6-v2-mmarcoFR) | 32.92 | 47.56 | 77.27 | 88.15 | 94.85 | 98.15 | |
|
| 13 | [crossencoder-msmarco-MiniLM-L-4-v2-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-MiniLM-L-4-v2-mmarcoFR) | 30.98 | 46.22 | 76.35 | 85.8 | 94.35 | 97.55 | |
|
| 14 | [crossencoder-MiniLM-L6-H384-uncased-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-MiniLM-L6-H384-uncased-mmarcoFR) | 29.23 | 45.12 | 76.08 | 83.7 | 92.65 | 97.45 | |
|
| 15 | [crossencoder-electra-base-discriminator-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-electra-base-discriminator-mmarcoFR) | 28.48 | 43.58 | 75.63 | 86.15 | 93.25 | 96.6 | |
|
| 16 | [crossencoder-electra-small-discriminator-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-electra-small-discriminator-mmarcoFR) | 31.83 | 45.97 | 75.13 | 84.95 | 94.55 | 98.15 | |
|
| 17 | [crossencoder-distilroberta-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-distilroberta-base-mmarcoFR) | 28.22 | 42.85 | 74.13 | 84.08 | 94.2 | 98.5 | |
|
| 18 | [crossencoder-msmarco-TinyBERT-L-6-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-TinyBERT-L-6-mmarcoFR) | 28.23 | 42.7 | 73.63 | 85.65 | 92.65 | 98.35 | |
|
| 19 | [crossencoder-msmarco-TinyBERT-L-4-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-TinyBERT-L-4-mmarcoFR) | 28.6 | 43.19 | 72.17 | 81.95 | 92.8 | 97.4 | |
|
| 20 | [crossencoder-msmarco-MiniLM-L-2-v2-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-MiniLM-L-2-v2-mmarcoFR) | 30.82 | 44.3 | 72.03 | 82.65 | 93.35 | 98.1 | |
|
| 21 | [crossencoder-distilbert-base-uncased-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-distilbert-base-uncased-mmarcoFR) | 25.47 | 40.11 | 71.37 | 85.6 | 93.85 | 97.95 | |
|
| 22 | [crossencoder-msmarco-TinyBERT-L-2-v2-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-TinyBERT-L-2-v2-mmarcoFR) | 31.08 | 43.88 | 71.3 | 81.43 | 92.6 | 98.1 | |
|
|
|
## Training |
|
*** |
|
|
|
#### Background |
|
|
|
We used the [dbmdz/electra-base-french-europeana-cased-discriminator](https://huggingface.co/dbmdz/electra-base-french-europeana-cased-discriminator) model and fine-tuned it with a binary cross-entropy loss function on 1M question-passage pairs in French with a positive-to-negative ratio of 4 (i.e., 25% of the pairs are relevant and 75% are irrelevant). |
|
|
|
#### Hyperparameters |
|
|
|
We trained the model on a single Tesla V100 GPU with 32GBs of memory during 10 epochs (i.e., 312.4k steps) using a batch size of 32. We used the adamw optimizer with an initial learning rate of 2e-05, weight decay of 0.01, learning rate warmup over the first 500 steps, and linear decay of the learning rate. The sequence length was limited to 512 tokens. |
|
|
|
#### Data |
|
|
|
We used the French version of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset to fine-tune our model. mMARCO is a multi-lingual machine-translated version of the MS MARCO dataset, a popular large-scale IR dataset. |
|
|
|
## Citation |
|
*** |
|
|
|
```bibtex |
|
@online{louis2023, |
|
author = 'Antoine Louis', |
|
title = 'crossencoder-electra-base-french-europeana-cased-discriminator-mmarcoFR: A Cross-Encoder Model Trained on 1M sentence pairs in French', |
|
publisher = 'Hugging Face', |
|
month = 'september', |
|
year = '2023', |
|
url = 'https://huggingface.co/antoinelouis/crossencoder-electra-base-french-europeana-cased-discriminator-mmarcoFR', |
|
} |
|
``` |