File size: 10,497 Bytes
f008976 1745070 f008976 1745070 f008976 1745070 8ef8509 1745070 0bd19df 1745070 4555fd6 95a17b6 1745070 0bd19df 1745070 0bd19df 1745070 0bd19df 1745070 0bd19df 1745070 0bd19df 1745070 0bd19df 1745070 0bd19df 1745070 0bd19df 1745070 0bd19df 1745070 0bd19df 1745070 0bd19df 1745070 0bd19df 1745070 0bd19df 1745070 0bd19df 0944b15 0bd19df 1745070 0bd19df 1745070 0bd19df 1745070 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 |
---
language: fr
datasets:
- etalab-ia/piaf
- fquad
- lincoln/newsquadfr
- pragnakalp/squad_v2_french_translated
widget:
- text: Combien de personnes utilisent le français tous les jours ?
context: >-
Le français est une langue indo-européenne de la famille des langues romanes
dont les locuteurs sont appelés francophones. Elle est parfois surnommée la
langue de Molière. Le français est parlé, en 2023, sur tous les continents
par environ 321 millions de personnes : 235 millions l'emploient
quotidiennement et 90 millions en sont des locuteurs natifs. En 2018, 80
millions d'élèves et étudiants s'instruisent en français dans le monde.
Selon l'Organisation internationale de la francophonie (OIF), il pourrait y
avoir 700 millions de francophones sur Terre en 2050.
license: cc-by-4.0
metrics:
- f1
- exact_match
library_name: transformers
pipeline_tag: question-answering
co2_eq_emissions: 200
---
# Model Card for QAmembert-large
## Model Description
We present **QAmemBERT**, which is a [CamemBERT large](https://huggingface.co/camembert/camembert-large) fine-tuned for the Question-Answering task for the French language on four French Q&A datasets composed of contexts and questions with their answers inside the context (= SQuAD 1.0 format) but also contexts and questions with their answers not inside the context (= SQuAD 2.0 format).
All these datasets were concatenated into a single dataset that we called [frenchQA](https://huggingface.co/datasets/CATIE-AQ/frenchQA).
This represents a total of over **221,348 context/question/answer triplets used to finetune this model and 6,376 to test it**.
Our methodology is described in a blog post available in [English](https://blog.vaniila.ai/en/QA_en/) or [French](https://blog.vaniila.ai/QA/).
## Datasets
| Dataset | Format | Train split | Dev split | Test split |
| ----------- | ----------- | ----------- | ----------- | ----------- |
| [piaf](https://www.data.gouv.fr/en/datasets/piaf-le-dataset-francophone-de-questions-reponses/)| SQuAD 1.0 | 9 224 Q & A | X | X |
| piaf_v2| SQuAD 2.0 | 9 224 Q & A | X | X |
| [fquad](https://fquad.illuin.tech/)| SQuAD 1.0 | 20 731 Q & A | 3 188 Q & A (not used in training because it serves as a test dataset) | 2 189 Q & A (not used in our work because not freely available)|
| fquad_v2 | SQuAD 2.0 | 20 731 Q & A | 3 188 Q & A (not used in training because it serves as a test dataset) | X |
| [lincoln/newsquadfr](https://huggingface.co/datasets/lincoln/newsquadfr) | SQuAD 1.0 | 1 650 Q & A | 455 Q & A (not used in our work) | X |
| lincoln/newsquadfr_v2 | SQuAD 2.0 | 1 650 Q & A | 455 Q & A (not used in our work) | X |
| [pragnakalp/squad_v2_french_translated](https://huggingface.co/datasets/pragnakalp/squad_v2_french_translated)| SQuAD 2.0 | 79 069 Q & A | X | X |
| pragnakalp/squad_v2_french_translated_v2| SQuAD 2.0 | 79 069 Q & A | X | X |
All these datasets were concatenated into a single dataset that we called [frenchQA](https://huggingface.co/datasets/CATIE-AQ/frenchQA).
## Evaluation results
The evaluation was carried out using the [**evaluate**](https://pypi.org/project/evaluate/) python package.
### FQuaD 1.0 (validation)
The metric used is SQuAD 1.0.
| Model | Exact_match | F1-score |
| ----------- | ----------- | ----------- |
| [etalab-ia/camembert-base-squadFR-fquad-piaf](https://huggingface.co/etalab-ia/camembert-base-squadFR-fquad-piaf) | 53.60 | 78.09 |
| QAmembert (previous version) | 54.26 | 77.87 |
| [QAmembert (version on HF)](https://huggingface.co/CATIE-AQ/QAmembert) | 53.98 | 78.00 |
| QAmembert-large | **55.95** | **81.05** |
### qwant/squad_fr (validation)
The metric used is SQuAD 1.0.
| Model | Exact_match | F1-score |
| ----------- | ----------- | ----------- |
| [etalab-ia/camembert-base-squadFR-fquad-piaf](https://huggingface.co/etalab-ia/camembert-base-squadFR-fquad-piaf) | 60.17 | 78.27 |
| QAmembert (previous version) | 60.40 | 77.27 |
| [QAmembert (version on HF)](https://huggingface.co/CATIE-AQ/QAmembert) | 60.95 | 77.30 |
| QAmembert-large | **65.58** | **81.74** |
### frenchQA
This dataset includes question with no answers in the context. The metric used is SQuAD 2.0.
| Model | Exact_match | F1-score | Answer_f1 | NoAnswer_f1 |
| ----------- | ----------- | ----------- | ----------- | ----------- |
| [etalab-ia/camembert-base-squadFR-fquad-piaf](https://huggingface.co/etalab-ia/camembert-base-squadFR-fquad-piaf) | n/a | n/a | n/a | n/a |
| QAmembert (previous version) | 60.28 | 71.29 | 75.92 | 66.65
| [QAmembert (version on HF)](https://huggingface.co/CATIE-AQ/QAmembert) | **77.14** | 86.88 | 75.66 | 98.11
| QAmembert-large | **77.14** | **88.74** | **78.83** | **98.65**
## Usage
### Example with answer in the context
```python
from transformers import pipeline
qa = pipeline('question-answering', model='CATIE-AQ/QAmembert-large', tokenizer='CATIE-AQ/QAmembert-large')
result = qa({
'question': "Combien de personnes utilisent le français tous les jours ?",
'context': "Le français est une langue indo-européenne de la famille des langues romanes dont les locuteurs sont appelés francophones. Elle est parfois surnommée la langue de Molière. Le français est parlé, en 2023, sur tous les continents par environ 321 millions de personnes : 235 millions l'emploient quotidiennement et 90 millions en sont des locuteurs natifs. En 2018, 80 millions d'élèves et étudiants s'instruisent en français dans le monde. Selon l'Organisation internationale de la francophonie (OIF), il pourrait y avoir 700 millions de francophones sur Terre en 2050."
})
if result['score'] < 0.01:
print("La réponse n'est pas dans le contexte fourni.")
else :
print(result['answer'])
```
```python
235 millions
```
```python
# details
result
{'score': 0.9876325726509094,
'start': 268,
'end': 281,
'answer': ' 235 millions'}
```
### Example with answer not in the context
```python
from transformers import pipeline
qa = pipeline('question-answering', model='CATIE-AQ/QAmembert-large', tokenizer='CATIE-AQ/QAmembert-large')
result = qa({
'question': "Quel est le meilleur vin du monde ?",
'context': "La tour Eiffel est une tour de fer puddlé de 330 m de hauteur (avec antennes) située à Paris, à l’extrémité nord-ouest du parc du Champ-de-Mars en bordure de la Seine dans le 7e arrondissement. Son adresse officielle est 5, avenue Anatole-France.
Construite en deux ans par Gustave Eiffel et ses collaborateurs pour l'Exposition universelle de Paris de 1889, célébrant le centenaire de la Révolution française, et initialement nommée « tour de 300 mètres », elle est devenue le symbole de la capitale française et un site touristique de premier plan : il s’agit du quatrième site culturel français payant le plus visité en 2016, avec 5,9 millions de visiteurs. Depuis son ouverture au public, elle a accueilli plus de 300 millions de visiteurs."
})
if result['score'] < 0.01:
print("La réponse n'est pas dans le contexte fourni.")
else :
print(result['answer'])
```
```python
La réponse n'est pas dans le contexte fourni.
```
```python
# details
result
{'score': 1.1262776822285048e-10,
'start': 735,
'end': 746,
'answer': 'visiteurs.'}
```
## Environmental Impact
*Carbon emissions were estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.*
- **Hardware Type:** A100 PCIe 40/80GB
- **Hours used:** 11h and 12min
- **Cloud Provider:** Private Infrastructure
- **Carbon Efficiency (kg/kWh):** 0.076kg (estimated from [electricitymaps](https://app.electricitymaps.com/zone/FR) ; we take the average carbon intensity in France for the month of March 2023, as we are unable to use the data for the day of training, which are not available.)
- **Carbon Emitted** *(Power consumption x Time x Carbon produced based on location of power grid)*: 0.20 kg eq. CO2
## Citations
### QAmemBERT
```
@misc {qamembert2023,
author = { {ALBAR, Boris and BEDU, Pierre and BOURDOIS, Loïck} },
organization = { {Centre Aquitain des Technologies de l'Information et Electroniques} },
title = { QAmembert (Revision 9685bc3) },
year = 2023,
url = { https://huggingface.co/CATIE-AQ/QAmembert-large },
doi = { 10.57967/hf/0821 },
publisher = { Hugging Face }
}
```
### PIAF
```
@inproceedings{KeraronLBAMSSS20,
author = {Rachel Keraron and
Guillaume Lancrenon and
Mathilde Bras and
Fr{\'{e}}d{\'{e}}ric Allary and
Gilles Moyse and
Thomas Scialom and
Edmundo{-}Pavel Soriano{-}Morales and
Jacopo Staiano},
title = {Project {PIAF:} Building a Native French Question-Answering Dataset},
booktitle = {{LREC}},
pages = {5481--5490},
publisher = {European Language Resources Association},
year = {2020}
}
```
### FQuAD
```
@article{dHoffschmidt2020FQuADFQ,
title={FQuAD: French Question Answering Dataset},
author={Martin d'Hoffschmidt and Maxime Vidal and Wacim Belblidia and Tom Brendl'e and Quentin Heinrich},
journal={ArXiv},
year={2020},
volume={abs/2002.06071}
}
```
### lincoln/newsquadfr
```
Hugging Face repository: https://huggingface.co/datasets/lincoln/newsquadfr
```
### pragnakalp/squad_v2_french_translated
```
Hugging Face repository: https://huggingface.co/datasets/pragnakalp/squad_v2_french_translated
```
### CamemBERT
```
@inproceedings{martin2020camembert,
title={CamemBERT: a Tasty French Language Model},
author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
year={2020}
}
```
## License
[cc-by-4.0](https://creativecommons.org/licenses/by/4.0/deed.en)
|