|
--- |
|
pipeline_tag: text-classification |
|
language: fr |
|
license: mit |
|
datasets: |
|
- unicamp-dl/mmarco |
|
metrics: |
|
- recall |
|
tags: |
|
- passage-reranking |
|
library_name: sentence-transformers |
|
base_model: camembert-base |
|
--- |
|
|
|
# crossencoder-distilcamembert-mmarcoFR |
|
|
|
This is a cross-encoder model for French. It performs cross-attention between a question-passage pair and outputs a relevance score. |
|
The model should be used as a reranker for semantic search: given a query and a set of potentially relevant passages retrieved by an efficient first-stage |
|
retrieval system (e.g., BM25 or a fine-tuned dense single-vector bi-encoder), encode each query-passage pair and sort the passages in a decreasing order of |
|
relevance according to the model's predicted scores. |
|
|
|
## Usage |
|
|
|
Here are some examples for using the model with [Sentence-Transformers](#using-sentence-transformers), [FlagEmbedding](#using-flagembedding), or [Huggingface Transformers](#using-huggingface-transformers). |
|
|
|
#### Using Sentence-Transformers |
|
|
|
Start by installing the [library](https://www.SBERT.net): `pip install -U sentence-transformers`. Then, you can use the model like this: |
|
|
|
```python |
|
from sentence_transformers import CrossEncoder |
|
|
|
pairs = [('Question', 'Paragraphe 1'), ('Question', 'Paragraphe 2') , ('Question', 'Paragraphe 3')] |
|
|
|
model = CrossEncoder('antoinelouis/crossencoder-distilcamembert-mmarcoFR') |
|
scores = model.predict(pairs) |
|
print(scores) |
|
``` |
|
|
|
#### Using FlagEmbedding |
|
|
|
Start by installing the [library](https://github.com/FlagOpen/FlagEmbedding/): `pip install -U FlagEmbedding`. Then, you can use the model like this: |
|
|
|
```python |
|
from FlagEmbedding import FlagReranker |
|
|
|
pairs = [('Question', 'Paragraphe 1'), ('Question', 'Paragraphe 2') , ('Question', 'Paragraphe 3')] |
|
|
|
reranker = FlagReranker('antoinelouis/crossencoder-distilcamembert-mmarcoFR') |
|
scores = reranker.compute_score(pairs) |
|
print(scores) |
|
``` |
|
|
|
#### Using HuggingFace Transformers |
|
|
|
Start by installing the [library](https://huggingface.co/docs/transformers): `pip install -U transformers`. Then, you can use the model like this: |
|
|
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
pairs = [('Question', 'Paragraphe 1'), ('Question', 'Paragraphe 2') , ('Question', 'Paragraphe 3')] |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('antoinelouis/crossencoder-distilcamembert-mmarcoFR') |
|
model = AutoModelForSequenceClassification.from_pretrained('antoinelouis/crossencoder-distilcamembert-mmarcoFR') |
|
model.eval() |
|
|
|
with torch.no_grad(): |
|
inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512) |
|
scores = model(**inputs, return_dict=True).logits.view(-1, ).float() |
|
print(scores) |
|
``` |
|
|
|
*** |
|
|
|
## Evaluation |
|
|
|
We evaluate the model on 500 random training queries from [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/) (which were excluded from training) by reranking |
|
subsets of candidate passages comprising of at least one relevant and up to 200 BM25 negative passages for each query. Below, we compare the model performance with other |
|
cross-encoder models fine-tuned on the same dataset. We report the R-precision (RP), mean reciprocal rank (MRR), and recall at various cut-offs (R@k). |
|
|
|
| | model | Vocab. | #Param. | Size | RP | MRR@10 | R@10(↑) | R@20 | R@50 | R@100 | |
|
|---:|:-----------------------------------------------------------------------------------------------------------------------------|:-------|--------:|------:|-------:|---------:|---------:|-------:|-------:|--------:| |
|
| 1 | [crossencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-camembert-base-mmarcoFR) | fr | 110M | 443MB | 35.65 | 50.44 | 82.95 | 91.50 | 96.80 | 98.80 | |
|
| 2 | [crossencoder-mMiniLMv2-L12-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L12-mmarcoFR) | fr,99+ | 118M | 471MB | 34.37 | 51.01 | 82.23 | 90.60 | 96.45 | 98.40 | |
|
| 3 | [crossencoder-mpnet-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mpnet-base-mmarcoFR) | en | 109M | 438MB | 29.68 | 46.13 | 80.45 | 87.90 | 93.15 | 96.60 | |
|
| 4 | **crossencoder-distilcamembert-mmarcoFR** | fr | 68M | 272MB | 27.28 | 43.71 | 80.30 | 89.10 | 95.55 | 98.60 | |
|
| 5 | [crossencoder-electra-base-french-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-electra-base-french-mmarcoFR) | fr | 110M | 443MB | 28.32 | 45.28 | 79.22 | 87.15 | 93.15 | 95.75 | |
|
| 6 | [crossencoder-mMiniLMv2-L6-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L6-mmarcoFR) | fr,99+ | 107M | 428MB | 33.92 | 49.33 | 79.00 | 88.35 | 94.80 | 98.20 | |
|
|
|
*** |
|
|
|
## Training |
|
|
|
#### Data |
|
|
|
We use the French training samples from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multilingual machine-translated version of MS MARCO |
|
that contains 8.8M passages and 539K training queries. We sample 1M question-passage pairs from the official ~39.8M |
|
[training triples](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset) with a positive-to-negative ratio of 4 (i.e., 25% of the pairs are |
|
relevant and 75% are irrelevant). |
|
|
|
#### Implementation |
|
|
|
The model is initialized from the [cmarkea/distilcamembert-base](https://huggingface.co/cmarkea/distilcamembert-base) checkpoint and optimized via the binary cross-entropy loss |
|
(as in [monoBERT](https://doi.org/10.48550/arXiv.1910.14424)). It is fine-tuned on one 32GB NVIDIA V100 GPU for 10 epochs (i.e., 312.4k steps) using the AdamW optimizer |
|
with a batch size of 32, a peak learning rate of 2e-5 with warm up along the first 500 steps and linear scheduling. We set the maximum sequence length of the |
|
concatenated question-passage pairs to 512 tokens. We use the sigmoid function to get scores between 0 and 1. |
|
|
|
*** |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@online{louis2023, |
|
author = 'Antoine Louis', |
|
title = 'crossencoder-distilcamembert-mmarcoFR: A Cross-Encoder Model Trained on 1M sentence pairs in French', |
|
publisher = 'Hugging Face', |
|
month = 'september', |
|
year = '2023', |
|
url = 'https://huggingface.co/antoinelouis/crossencoder-distilcamembert-mmarcoFR', |
|
} |
|
``` |