psorianom commited on
Commit
88d608e
1 Parent(s): 633f362

create README.md

Browse files
Files changed (1) hide show
  1. README.md +202 -0
README.md ADDED
@@ -0,0 +1,202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: fr
3
+ datasets:
4
+ - piaf
5
+ - FQuAD
6
+ - SQuAD-FR
7
+ ---
8
+
9
+ # dpr-ctx_encoder-fr_qa-camembert
10
+
11
+ ## Description
12
+
13
+ French [DPR model](https://arxiv.org/abs/2004.04906) using [CamemBERT](https://arxiv.org/abs/1911.03894) as base and then fine-tuned on a combo of three French Q&A
14
+ ## Data
15
+ ### French Q&A
16
+ We use a combination of three French Q&A datasets:
17
+
18
+ 1. [PIAFv1.1](https://www.data.gouv.fr/en/datasets/piaf-le-dataset-francophone-de-questions-reponses/)
19
+ 2. [FQuADv1.0](https://fquad.illuin.tech/)
20
+ 3. [SQuAD-FR (SQuAD automatically translated to French)](https://github.com/Alikabbadj/French-SQuAD)
21
+
22
+ ### Training
23
+
24
+
25
+ We are using 90 562 random questions for `train` and 22 391 for `dev`. No question in `train` exists in `dev`. For each question, we have a single `positive_context` (the paragraph where the answer to this question is found) and around 30 `hard_negtive_contexts`. Hard negative contexts are found by querying an ES instance (via bm25 retrieval) and getting the top-k candidates **that do not contain the answer**.
26
+
27
+ The files are over [here](https://drive.google.com/file/d/1W5Jm3sqqWlsWsx2sFpA39Ewn33PaLQ7U/view?usp=sharing).
28
+
29
+ ### Evaluation
30
+
31
+
32
+ We use FQuADv1.0 and French-SQuAD evaluation sets.
33
+
34
+
35
+ ## Training Script
36
+ We use the official [Facebook DPR implentation](https://github.com/facebookresearch/DPR) with a slight modification: by default, the code can work with Roberta models, still we changed a single line to make it easier to work with Camembert. This modification can be found [over here](https://github.com/psorianom/DPR).
37
+
38
+ ### Hyperparameters
39
+
40
+ ```shell
41
+ python -m torch.distributed.launch --nproc_per_node=8 train_dense_encoder.py \
42
+ --max_grad_norm 2.0 \
43
+ --encoder_model_type fairseq_roberta \
44
+ --pretrained_file data/camembert-base \
45
+ --seed 12345 \
46
+ --sequence_length 256 \
47
+ --warmup_steps 1237 \
48
+ --batch_size 16 \
49
+ --do_lower_case \
50
+ --train_file ./data/DPR_FR_train.json \
51
+ --dev_file ./data/DPR_FR_dev.json \
52
+ --output_dir ./output/ \
53
+ --learning_rate 2e-05 \
54
+ --num_train_epochs 35 \
55
+ --dev_batch_size 16 \
56
+ --val_av_rank_start_epoch 30 \
57
+ --pretrained_model_cfg ./data/camembert-base/
58
+ ```
59
+
60
+ ###
61
+
62
+ ## Evaluation results
63
+ We obtain the following evaluation by using FQuAD and SQuAD-FR evaluation (or validation) sets. To obtain these results, we use [haystack's evaluation script](https://github.com/deepset-ai/haystack/blob/db4151bbc026f27c6d709fefef1088cd3f1e18b9/tutorials/Tutorial5_Evaluation.py) (**we report Retrieval results only**).
64
+
65
+ ### DPR
66
+
67
+ #### FQuAD v1.0 Evaluation
68
+ ```shell
69
+ For 2764 out of 3184 questions (86.81%), the answer was in the top-20 candidate passages selected by the retriever.
70
+ Retriever Recall: 0.87
71
+ Retriever Mean Avg Precision: 0.57
72
+ ```
73
+ #### SQuAD-FR Evaluation
74
+ ```shell
75
+ For 8945 out of 10018 questions (89.29%), the answer was in the top-20 candidate passages selected by the retriever.
76
+ Retriever Recall: 0.89
77
+ Retriever Mean Avg Precision: 0.63
78
+ ```
79
+
80
+ ### BM25
81
+
82
+
83
+ For reference, BM25 gets the results shown below. As in the original paper, regarding SQuAD-like datasets, the results of DPR are consistently superseeded by BM25.
84
+
85
+ #### FQuAD v1.0 Evaluation
86
+ ```shell
87
+ For 2966 out of 3184 questions (93.15%), the answer was in the top-20 candidate passages selected by the retriever.
88
+ Retriever Recall: 0.93
89
+ Retriever Mean Avg Precision: 0.74
90
+ ```
91
+ #### SQuAD-FR Evaluation
92
+ ```shell
93
+ For 9353 out of 10018 questions (93.36%), the answer was in the top-20 candidate passages selected by the retriever.
94
+ Retriever Recall: 0.93
95
+ Retriever Mean Avg Precision: 0.77
96
+ ```
97
+
98
+ ## Usage
99
+
100
+ The results reported here are obtained with the `haystack` library. To get to similar embeddings using exclusively HF `transformers` library, you can do the following:
101
+
102
+ ```python
103
+ from transformers import AutoTokenizer, AutoModel
104
+ query = "Salut, mon chien est-il mignon ?"
105
+ tokenizer = AutoTokenizer.from_pretrained("etalab-ia/dpr-ctx_encoder-fr_qa-camembert", do_lower_case=True)
106
+ input_ids = tokenizer(query, return_tensors='pt')["input_ids"]
107
+ model = AutoModel.from_pretrained("etalab-ia/dpr-ctx_encoder-fr_qa-camembert", return_dict=True)
108
+ embeddings = model.forward(input_ids).pooler_output
109
+ print(embeddings)
110
+ ```
111
+
112
+ And with `haystack` (using `transformers-3.3.1`), we use it as a retriever (**note that we reference it from a local path**):
113
+ ```
114
+ retriever = DensePassageRetriever(document_store=document_store,
115
+ query_embedding_model="./etalab-ia/dpr-question_encoder-fr_qa-camembert",
116
+ passage_embedding_model="./etalab-ia/dpr-ctx_encoder-fr_qa-camembert",
117
+ use_gpu=True,
118
+ embed_title=False,
119
+ batch_size=16,
120
+ use_fast_tokenizers=False
121
+ )
122
+ ```
123
+ ## Acknoledgements
124
+
125
+ This work was performed using HPC resources from GENCI–IDRIS (Grant 2020-AD011011224).
126
+
127
+
128
+ ## Citations
129
+
130
+ ### Datasets
131
+
132
+ #### PIAF
133
+ ```
134
+ @inproceedings{KeraronLBAMSSS20,
135
+ author = {Rachel Keraron and
136
+ Guillaume Lancrenon and
137
+ Mathilde Bras and
138
+ Fr{\'{e}}d{\'{e}}ric Allary and
139
+ Gilles Moyse and
140
+ Thomas Scialom and
141
+ Edmundo{-}Pavel Soriano{-}Morales and
142
+ Jacopo Staiano},
143
+ title = {Project {PIAF:} Building a Native French Question-Answering Dataset},
144
+ booktitle = {{LREC}},
145
+ pages = {5481--5490},
146
+ publisher = {European Language Resources Association},
147
+ year = {2020}
148
+ }
149
+
150
+ ```
151
+
152
+ #### FQuAD
153
+ ```
154
+ @article{dHoffschmidt2020FQuADFQ,
155
+ title={FQuAD: French Question Answering Dataset},
156
+ author={Martin d'Hoffschmidt and Maxime Vidal and Wacim Belblidia and Tom Brendl'e and Quentin Heinrich},
157
+ journal={ArXiv},
158
+ year={2020},
159
+ volume={abs/2002.06071}
160
+ }
161
+ ```
162
+
163
+ #### SQuAD-FR
164
+ ```
165
+ @MISC{kabbadj2018,
166
+ author = "Kabbadj, Ali",
167
+ title = "Something new in French Text Mining and Information Extraction (Universal Chatbot): Largest Q&A French training dataset (110 000+) ",
168
+ editor = "linkedin.com",
169
+ month = "November",
170
+ year = "2018",
171
+ url = "\url{https://www.linkedin.com/pulse/something-new-french-text-mining-information-chatbot-largest-kabbadj/}",
172
+ note = "[Online; posted 11-November-2018]",
173
+ }
174
+ ```
175
+ ### Models
176
+
177
+ #### CamemBERT
178
+ HF model card : [https://huggingface.co/camembert-base](https://huggingface.co/camembert-base)
179
+
180
+ ```
181
+ @inproceedings{martin2020camembert,
182
+ title={CamemBERT: a Tasty French Language Model},
183
+ author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
184
+ booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
185
+ year={2020}
186
+ }
187
+ ```
188
+
189
+ #### DPR
190
+
191
+ ```
192
+ @misc{karpukhin2020dense,
193
+ title={Dense Passage Retrieval for Open-Domain Question Answering},
194
+ author={Vladimir Karpukhin and Barlas Oğuz and Sewon Min and Patrick Lewis and Ledell Wu and Sergey Edunov and Danqi Chen and Wen-tau Yih},
195
+ year={2020},
196
+ eprint={2004.04906},
197
+ archivePrefix={arXiv},
198
+ primaryClass={cs.CL}
199
+ }
200
+ ```
201
+
202
+