--- language: fr datasets: - piaf - FQuAD - SQuAD-FR --- # dpr-question_encoder-fr_qa-camembert ## Description French [DPR model](https://arxiv.org/abs/2004.04906) using [CamemBERT](https://arxiv.org/abs/1911.03894) as base and then fine-tuned on a combo of three French Q&A ## Data ### French Q&A We use a combination of three French Q&A datasets: 1. [PIAFv1.1](https://www.data.gouv.fr/en/datasets/piaf-le-dataset-francophone-de-questions-reponses/) 2. [FQuADv1.0](https://fquad.illuin.tech/) 3. [SQuAD-FR (SQuAD automatically translated to French)](https://github.com/Alikabbadj/French-SQuAD) ### Training We are using 90 562 random questions for `train` and 22 391 for `dev`. No question in `train` exists in `dev`. For each question, we have a single `positive_context` (the paragraph where the answer to this question is found) and around 30 `hard_negtive_contexts`. Hard negative contexts are found by querying an ES instance (via bm25 retrieval) and getting the top-k candidates **that do not contain the answer**. The files are over [here](https://drive.google.com/file/d/1W5Jm3sqqWlsWsx2sFpA39Ewn33PaLQ7U/view?usp=sharing). ### Evaluation We use FQuADv1.0 and French-SQuAD evaluation sets. ## Training Script We use the official [Facebook DPR implentation](https://github.com/facebookresearch/DPR) with a slight modification: by default, the code can work with Roberta models, still we changed a single line to make it easier to work with Camembert. This modification can be found [over here](https://github.com/psorianom/DPR). ### Hyperparameters ```shell python -m torch.distributed.launch --nproc_per_node=8 train_dense_encoder.py \ --max_grad_norm 2.0 --encoder_model_type hf_bert --pretrained_file data/bert-base-multilingual-uncased \ --seed 12345 --sequence_length 256 --warmup_steps 1237 --batch_size 16 --do_lower_case \ --train_file DPR_FR_train.json \ --dev_file ./data/100_hard_neg_ctxs/DPR_FR_dev.json \ --output_dir ./output/bert --learning_rate 2e-05 --num_train_epochs 35 \ --dev_batch_size 16 --val_av_rank_start_epoch 25 \ --pretrained_model_cfg ./data/bert-base-multilingual-uncased ``` ### ## Evaluation results We obtain the following evaluation by using FQuAD and SQuAD-FR evaluation (or validation) sets. To obtain these results, we use [haystack's evaluation script](https://github.com/deepset-ai/haystack/blob/db4151bbc026f27c6d709fefef1088cd3f1e18b9/tutorials/Tutorial5_Evaluation.py) (**we report Retrieval results only**). ### DPR #### FQuAD v1.0 Evaluation ```shell For 2764 out of 3184 questions (86.81%), the answer was in the top-20 candidate passages selected by the retriever. Retriever Recall: 0.87 Retriever Mean Avg Precision: 0.57 ``` #### SQuAD-FR Evaluation ```shell For 8945 out of 10018 questions (89.29%), the answer was in the top-20 candidate passages selected by the retriever. Retriever Recall: 0.89 Retriever Mean Avg Precision: 0.63 ``` ### BM25 For reference, BM25 gets the results shown below. As in the original paper, regarding SQuAD-like datasets, the results of DPR are consistently superseeded by BM25. #### FQuAD v1.0 Evaluation ```shell For 2966 out of 3184 questions (93.15%), the answer was in the top-20 candidate passages selected by the retriever. Retriever Recall: 0.93 Retriever Mean Avg Precision: 0.74 ``` #### SQuAD-FR Evaluation ```shell For 9353 out of 10018 questions (93.36%), the answer was in the top-20 candidate passages selected by the retriever. Retriever Recall: 0.93 Retriever Mean Avg Precision: 0.77 ``` ## Usage The results reported here are obtained with the `haystack` library. To get to similar embeddings using exclusively HF `transformers` library, you can do the following: ```python from transformers import AutoTokenizer, AutoModel query = "Salut, mon chien est-il mignon ?" tokenizer = AutoTokenizer.from_pretrained("etalab-ia/dpr-question_encoder-fr_qa-camembert", do_lower_case=True) input_ids = tokenizer(query, return_tensors='pt')["input_ids"] model = AutoModel.from_pretrained("etalab-ia/dpr-question_encoder-fr_qa-camembert", return_dict=True) embeddings = model.forward(input_ids).pooler_output print(embeddings) ``` And with `haystack`, we use it as a retriever: ``` retriever = DensePassageRetriever( document_store=document_store, query_embedding_model="etalab-ia/dpr-question_encoder-fr_qa-camembert", passage_embedding_model="etalab-ia/dpr-ctx_encoder-fr_qa-camembert", model_version=dpr_model_tag, infer_tokenizer_classes=True, ) ``` ## Acknowledgments This work was performed using HPC resources from GENCI–IDRIS (Grant 2020-AD011011224). ## Citations ### Datasets #### PIAF ``` @inproceedings{KeraronLBAMSSS20, author = {Rachel Keraron and Guillaume Lancrenon and Mathilde Bras and Fr{\'{e}}d{\'{e}}ric Allary and Gilles Moyse and Thomas Scialom and Edmundo{-}Pavel Soriano{-}Morales and Jacopo Staiano}, title = {Project {PIAF:} Building a Native French Question-Answering Dataset}, booktitle = {{LREC}}, pages = {5481--5490}, publisher = {European Language Resources Association}, year = {2020} } ``` #### FQuAD ``` @article{dHoffschmidt2020FQuADFQ, title={FQuAD: French Question Answering Dataset}, author={Martin d'Hoffschmidt and Maxime Vidal and Wacim Belblidia and Tom Brendl'e and Quentin Heinrich}, journal={ArXiv}, year={2020}, volume={abs/2002.06071} } ``` #### SQuAD-FR ``` @MISC{kabbadj2018, author = "Kabbadj, Ali", title = "Something new in French Text Mining and Information Extraction (Universal Chatbot): Largest Q&A French training dataset (110 000+) ", editor = "linkedin.com", month = "November", year = "2018", url = "\url{https://www.linkedin.com/pulse/something-new-french-text-mining-information-chatbot-largest-kabbadj/}", note = "[Online; posted 11-November-2018]", } ``` ### Models #### CamemBERT HF model card : [https://huggingface.co/camembert-base](https://huggingface.co/camembert-base) ``` @inproceedings{martin2020camembert, title={CamemBERT: a Tasty French Language Model}, author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t}, booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics}, year={2020} } ``` #### DPR ``` @misc{karpukhin2020dense, title={Dense Passage Retrieval for Open-Domain Question Answering}, author={Vladimir Karpukhin and Barlas Oğuz and Sewon Min and Patrick Lewis and Ledell Wu and Sergey Edunov and Danqi Chen and Wen-tau Yih}, year={2020}, eprint={2004.04906}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```