antoinelouis
/

crossencoder-distilcamembert-mmarcoFR

@@ -1,69 +1,83 @@
 ---
-pipeline_tag: sentence-similarity
 language: fr
-license: apache-2.0
 datasets:
 - unicamp-dl/mmarco
 metrics:
 - recall
 tags:
-- sentence-similarity
 library_name: sentence-transformers
 ---
-# crossencoder-distilcamembert-mmarcoFR
-This is a [sentence-transformers](https://www.SBERT.net) model trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
-It performs cross-attention between a question-passage pair and outputs a relevance score between 0 and 1. The model can be used for tasks like clustering or [semantic search]((https://www.sbert.net/examples/applications/retrieve_rerank/README.html): given a query, encode the latter with some candidate passages -- e.g., retrieved with BM25 or a biencoder -- then sort the passages in a decreasing order of relevance according to the model's predictions.
 ## Usage
-***
-#### Sentence-Transformers
-Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
-```bash
-pip install -U sentence-transformers
-```
-Then you can use the model like this:
 ```python
 from sentence_transformers import CrossEncoder
-pairs = [('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')]
 model = CrossEncoder('antoinelouis/crossencoder-distilcamembert-mmarcoFR')
 scores = model.predict(pairs)
 print(scores)
 ```
-#### 🤗 Transformers
-Without [sentence-transformers](https://www.SBERT.net), you can use the model as follows:
 ```python
-from transformers import AutoTokenizer, AutoModelForSequenceClassification
-import torch
-model = AutoModelForSequenceClassification.from_pretrained('antoinelouis/crossencoder-distilcamembert-mmarcoFR')
-tokenizer = AutoTokenizer.from_pretrained('antoinelouis/crossencoder-distilcamembert-mmarcoFR')
-pairs = [('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')]
-features = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt')
 model.eval()
 with torch.no_grad():
-    scores = model(**features).logits
 print(scores)
 ```
-## Evaluation
 ***
-We evaluated the model on 500 random queries from the mMARCO-fr train set (which were excluded from training). Each of these queries has at least one relevant and up to 200 irrelevant passages.
-Below, we compare the model performance with other cross-encoder models fine-tuned on the same dataset. We report the R-precision (RP), mean reciprocal rank (MRR), and recall at various cut-offs (R@k).
 |    | model                                                                                                                        | Vocab. | #Param. |  Size |     RP |   MRR@10 |  R@10(↑) |   R@20 |   R@50 |   R@100 |
 |---:|:-----------------------------------------------------------------------------------------------------------------------------|:-------|--------:|------:|-------:|---------:|---------:|-------:|-------:|--------:|
@@ -74,23 +88,27 @@ Below, we compare the model performance with other cross-encoder models fine-tun
 |  5 | [crossencoder-electra-base-french-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-electra-base-french-mmarcoFR)   |     fr |    110M | 443MB |  28.32 |    45.28 |    79.22 |  87.15 |  93.15 |   95.75 |
 |  6 | [crossencoder-mMiniLMv2-L6-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L6-mmarcoFR)                 | fr,99+ |    107M | 428MB |  33.92 |    49.33 |    79.00 |  88.35 |  94.80 |   98.20 |
-## Training
 ***
-#### Background
-We used the [cmarkea/distilcamembert-base](https://huggingface.co/cmarkea/distilcamembert-base) model and fine-tuned it with a binary cross-entropy loss function on 1M question-passage pairs in French with a positive-to-negative ratio of 4 (i.e., 25% of the pairs are relevant and 75% are irrelevant).
-#### Hyperparameters
-We trained the model on a single Tesla V100 GPU with 32GBs of memory during 10 epochs (i.e., 312.4k steps) using a batch size of 32. We used the adamw optimizer with an initial learning rate of 2e-05, weight decay of 0.01, learning rate warmup over the first 500 steps, and linear decay of the learning rate. The sequence length was limited to 512 tokens.
-#### Data
-We used the French version of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset to fine-tune our model. mMARCO is a multi-lingual machine-translated version of the MS MARCO dataset, a popular large-scale IR dataset.
 ## Citation
-***
 ```bibtex
 @online{louis2023,

 ---
+pipeline_tag: text-classification
 language: fr
+license: mit
 datasets:
 - unicamp-dl/mmarco
 metrics:
 - recall
 tags:
+- passage-reranking
 library_name: sentence-transformers
+base_model: camembert-base
 ---
+# crossencoder-distilcamembert-mmarcoFR
+This is a cross-encoder model for French. It performs cross-attention between a question-passage pair and outputs a relevance score.
+The model should be used as a reranker for semantic search: given a query and a set of potentially relevant passages retrieved by an efficient first-stage
+retrieval system (e.g., BM25 or a fine-tuned dense single-vector bi-encoder), encode each query-passage pair and sort the passages in a decreasing order of
+relevance according to the model's predicted scores.
 ## Usage
+Here are some examples for using the model with [Sentence-Transformers](#using-sentence-transformers), [FlagEmbedding](#using-flagembedding), or [Huggingface Transformers](#using-huggingface-transformers).
+#### Using Sentence-Transformers
+Start by installing the [library](https://www.SBERT.net): `pip install -U sentence-transformers`. Then, you can use the model like this:
 ```python
 from sentence_transformers import CrossEncoder
+pairs = [('Question', 'Paragraphe 1'), ('Question', 'Paragraphe 2') , ('Question', 'Paragraphe 3')]
 model = CrossEncoder('antoinelouis/crossencoder-distilcamembert-mmarcoFR')
 scores = model.predict(pairs)
 print(scores)
 ```
+#### Using FlagEmbedding
+Start by installing the [library](https://github.com/FlagOpen/FlagEmbedding/): `pip install -U FlagEmbedding`. Then, you can use the model like this:
 ```python
+from FlagEmbedding import FlagReranker
+pairs = [('Question', 'Paragraphe 1'), ('Question', 'Paragraphe 2') , ('Question', 'Paragraphe 3')]
+reranker = FlagReranker('antoinelouis/crossencoder-distilcamembert-mmarcoFR')
+scores = reranker.compute_score(pairs)
+print(scores)
+```
+#### Using HuggingFace Transformers
+Start by installing the [library](https://huggingface.co/docs/transformers): `pip install -U transformers`. Then, you can use the model like this:
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+pairs = [('Question', 'Paragraphe 1'), ('Question', 'Paragraphe 2') , ('Question', 'Paragraphe 3')]
+tokenizer = AutoTokenizer.from_pretrained('antoinelouis/crossencoder-distilcamembert-mmarcoFR')
+model = AutoModelForSequenceClassification.from_pretrained('antoinelouis/crossencoder-distilcamembert-mmarcoFR')
 model.eval()
 with torch.no_grad():
+    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
+    scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
 print(scores)
 ```
 ***
+## Evaluation
+We evaluate the model on 500 random training queries from [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/) (which were excluded from training) by reranking
+subsets of candidate passages comprising of at least one relevant and up to 200 BM25 negative passages for each query. Below, we compare the model performance with other
+cross-encoder models fine-tuned on the same dataset. We report the R-precision (RP), mean reciprocal rank (MRR), and recall at various cut-offs (R@k).
 |    | model                                                                                                                        | Vocab. | #Param. |  Size |     RP |   MRR@10 |  R@10(↑) |   R@20 |   R@50 |   R@100 |
 |---:|:-----------------------------------------------------------------------------------------------------------------------------|:-------|--------:|------:|-------:|---------:|---------:|-------:|-------:|--------:|
 |  5 | [crossencoder-electra-base-french-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-electra-base-french-mmarcoFR)   |     fr |    110M | 443MB |  28.32 |    45.28 |    79.22 |  87.15 |  93.15 |   95.75 |
 |  6 | [crossencoder-mMiniLMv2-L6-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L6-mmarcoFR)                 | fr,99+ |    107M | 428MB |  33.92 |    49.33 |    79.00 |  88.35 |  94.80 |   98.20 |
 ***
+## Training
+#### Data
+We use the French training samples from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multilingual machine-translated version of MS MARCO
+that contains 8.8M passages and 539K training queries. We sample 1M question-passage pairs from the official ~39.8M
+[training triples](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset) with a positive-to-negative ratio of 4 (i.e., 25% of the pairs are
+relevant and 75% are irrelevant).
+#### Implementation
+The model is initialized from the [cmarkea/distilcamembert-base](https://huggingface.co/cmarkea/distilcamembert-base) checkpoint and optimized via the binary cross-entropy loss
+(as in [monoBERT](https://doi.org/10.48550/arXiv.1910.14424)). It is fine-tuned on one 32GB NVIDIA V100 GPU for 10 epochs (i.e., 312.4k steps) using the AdamW optimizer
+with a batch size of 32, a peak learning rate of 2e-5 with warm up along the first 500 steps and linear scheduling. We set the maximum sequence length of the
+concatenated question-passage pairs to 512 tokens. We use the sigmoid function to get scores between 0 and 1.
+***
 ## Citation
 ```bibtex
 @online{louis2023,