antoinelouis commited on
Commit
118908f
1 Parent(s): 81255f4

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +136 -0
README.md ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: text-classification
3
+ language: fr
4
+ license: mit
5
+ datasets:
6
+ - unicamp-dl/mmarco
7
+ metrics:
8
+ - recall
9
+ tags:
10
+ - passage-reranking
11
+ library_name: sentence-transformers
12
+ base_model: almanach/camemberta-base
13
+ model-index:
14
+ - name: crossencoder-camemberta-base-mmarcoFR
15
+ results:
16
+ - task:
17
+ type: text-classification
18
+ name: Passage Rerankingg
19
+ dataset:
20
+ type: unicamp-dl/mmarco
21
+ name: mMARCO-fr
22
+ config: french
23
+ split: validation
24
+ metrics:
25
+ - type: recall_at_100
26
+ name: Recall@100
27
+ value: 86.43
28
+ - type: recall_at_10
29
+ name: Recall@10
30
+ value: 61.23
31
+ - type: mrr_at_10
32
+ name: MRR@10
33
+ value: 35.25
34
+ ---
35
+
36
+ # crossencoder-camemberta-base-mmarcoFR
37
+
38
+ This is a cross-encoder model for French. It performs cross-attention between a question-passage pair and outputs a relevance score.
39
+ The model should be used as a reranker for semantic search: given a query and a set of potentially relevant passages retrieved by an efficient first-stage
40
+ retrieval system (e.g., BM25 or a fine-tuned dense single-vector bi-encoder), encode each query-passage pair and sort the passages in a decreasing order of
41
+ relevance according to the model's predicted scores.
42
+
43
+ ## Usage
44
+
45
+ Here are some examples for using the model with [Sentence-Transformers](#using-sentence-transformers), [FlagEmbedding](#using-flagembedding), or [Huggingface Transformers](#using-huggingface-transformers).
46
+
47
+ #### Using Sentence-Transformers
48
+
49
+ Start by installing the [library](https://www.SBERT.net): `pip install -U sentence-transformers`. Then, you can use the model like this:
50
+
51
+ ```python
52
+ from sentence_transformers import CrossEncoder
53
+
54
+ pairs = [('Question', 'Paragraphe 1'), ('Question', 'Paragraphe 2') , ('Question', 'Paragraphe 3')]
55
+
56
+ model = CrossEncoder('antoinelouis/crossencoder-camemberta-base-mmarcoFR')
57
+ scores = model.predict(pairs)
58
+ print(scores)
59
+ ```
60
+
61
+ #### Using FlagEmbedding
62
+
63
+ Start by installing the [library](https://github.com/FlagOpen/FlagEmbedding/): `pip install -U FlagEmbedding`. Then, you can use the model like this:
64
+
65
+ ```python
66
+ from FlagEmbedding import FlagReranker
67
+
68
+ pairs = [('Question', 'Paragraphe 1'), ('Question', 'Paragraphe 2') , ('Question', 'Paragraphe 3')]
69
+
70
+ reranker = FlagReranker('antoinelouis/crossencoder-camemberta-base-mmarcoFR')
71
+ scores = reranker.compute_score(pairs)
72
+ print(scores)
73
+ ```
74
+
75
+ #### Using HuggingFace Transformers
76
+
77
+ Start by installing the [library](https://huggingface.co/docs/transformers): `pip install -U transformers`. Then, you can use the model like this:
78
+
79
+ ```python
80
+ import torch
81
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
82
+
83
+ pairs = [('Question', 'Paragraphe 1'), ('Question', 'Paragraphe 2') , ('Question', 'Paragraphe 3')]
84
+
85
+ tokenizer = AutoTokenizer.from_pretrained('antoinelouis/crossencoder-camemberta-base-mmarcoFR')
86
+ model = AutoModelForSequenceClassification.from_pretrained('antoinelouis/crossencoder-camemberta-base-mmarcoFR')
87
+ model.eval()
88
+
89
+ with torch.no_grad():
90
+ inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
91
+ scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
92
+ print(scores)
93
+ ```
94
+
95
+ ***
96
+
97
+ ## Evaluation
98
+
99
+ The model is evaluated on the smaller development set of [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/), which consists of 6,980 queries for which
100
+ an ensemble of 1000 passages containing the positive(s) and [ColBERTv2 hard negatives](https://huggingface.co/datasets/antoinelouis/msmarco-dev-small-negatives) need
101
+ to be reranked. We report the mean reciprocal rank (MRR) and recall at various cut-offs (R@k). To see how it compares to other neural retrievers in French, check out
102
+ the [*DécouvrIR*](https://huggingface.co/spaces/antoinelouis/decouvrir) leaderboard.
103
+
104
+ ***
105
+
106
+ ## Training
107
+
108
+ #### Data
109
+
110
+ We use the French training samples from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multilingual machine-translated version of MS MARCO
111
+ that contains 8.8M passages and 539K training queries. We do not use the BM25 negatives provided by the official dataset but instead sample harder negatives mined from
112
+ 12 distinct dense retrievers, using the [msmarco-hard-negatives](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives#msmarco-hard-negativesjsonlgz)
113
+ distillation dataset. Eventually, we sample 2.6M training triplets of the form (query, passage, relevance) with a positive-to-negative ratio of 1 (i.e., 50% of the pairs are
114
+ relevant and 50% are irrelevant).
115
+
116
+ #### Implementation
117
+
118
+ The model is initialized from the [almanach/camemberta-base](https://huggingface.co/almanach/camemberta-base) checkpoint and optimized via the binary cross-entropy loss
119
+ (as in [monoBERT](https://doi.org/10.48550/arXiv.1910.14424)). It is fine-tuned on one 80GB NVIDIA H100 GPU for 20k steps using the AdamW optimizer
120
+ with a batch size of 128 and a constant learning rate of 2e-5. We set the maximum sequence length of the concatenated question-passage pairs to 256 tokens.
121
+ We use the sigmoid function to get scores between 0 and 1.
122
+
123
+ ***
124
+
125
+ ## Citation
126
+
127
+ ```bibtex
128
+ @online{louis2024decouvrir,
129
+ author = 'Antoine Louis',
130
+ title = 'DécouvrIR: A Benchmark for Evaluating the Robustness of Information Retrieval Models in French',
131
+ publisher = 'Hugging Face',
132
+ month = 'mar',
133
+ year = '2024',
134
+ url = 'https://huggingface.co/spaces/antoinelouis/decouvrir',
135
+ }
136
+ ```