File size: 7,528 Bytes
e48f7c4 d176ee5 faa57aa 30b5481 faa57aa 30b5481 faa57aa 30b5481 faa57aa 30b5481 faa57aa 30b5481 faa57aa e5724d2 dd06887 e5724d2 dd06887 e5724d2 e48f7c4 37954a2 e48f7c4 c8fc841 e48f7c4 3e82502 e48f7c4 f2ef822 e48f7c4 e786812 e48f7c4 2eefecf e48f7c4 c4d1d73 e48f7c4 2646972 005bca6 e48f7c4 ea1eeed e48f7c4 005bca6 e48f7c4 2646972 e48f7c4 2646972 e48f7c4 b46f080 6d69e11 e48f7c4 005bca6 e48f7c4 005bca6 e48f7c4 2646972 e48f7c4 2eefecf 4ea17ec e48f7c4 be072ad e48f7c4 005bca6 03bc643 005bca6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 |
---
language:
- fr
model-index:
- name: asi/gpt-fr-cased-base
results:
- task:
type: text-generation
name: Wikitext-fr
dataset:
type: wikitext_fr
name: Wikitext-fr
metrics:
- type: perplexity
value: 109.2
name: Perplexity
- task:
type: text-classification
name: FLUE
dataset:
type: flue
name: CLS-Books
split: CLS
metrics:
- type: accuracy
value: 88.3
name: Accuracy
- task:
type: text-classification
name: FLUE
dataset:
type: flue
name: CLS-Dvd
split: CLS
metrics:
- type: accuracy
value: 86.9
name: Accuracy
- task:
type: text-classification
name: FLUE
dataset:
type: flue
name: CLS-Music
split: CLS
metrics:
- type: accuracy
value: 89.3
name: Accuracy
- task:
type: text-classification
name: FLUE
dataset:
type: flue
name: PAWS-X
split: PAWS-X
metrics:
- type: accuracy
value: 83.3
name: Accuracy
- task:
type: text-classification
name: FLUE
dataset:
type: flue
name: XNLI
split: XNLI
metrics:
- type: accuracy
value: 75.6
name: Accuracy
- task:
type: summarization
name: OrangeSum
dataset:
type: orange_sum
name: OrangeSum-Abstract
split: abstract
metrics:
- name: ROUGE-1
type: rouge
value: 17.5
- name: ROUGE-2
type: rouge
value: 3.1
- name: ROUGE-L
type: rouge
value: 12.1
- task:
type: summarization
name: OrangeSum
dataset:
type: orange_sum
name: OrangeSum-Title
split: title
metrics:
- name: ROUGE-1
type: rouge
value: 13.9
- name: ROUGE-2
type: rouge
value: 2.3
- name: ROUGE-L
type: rouge
value: 9.7
tags:
- tf
- pytorch
- gpt2
- text-generation
license: apache-2.0
thumbnail: https://raw.githubusercontent.com/AntoineSimoulin/gpt-fr/main/imgs/logo.png
---
<img src="https://raw.githubusercontent.com/AntoineSimoulin/gpt-fr/main/imgs/logo.png" width="200">
## Model description
**GPT-fr** 🇫🇷 is a GPT model for French developped by [Quantmetry](https://www.quantmetry.com/) and the [Laboratoire de Linguistique Formelle (LLF)](http://www.llf.cnrs.fr/en). We train the model on a very large and heterogeneous French corpus. We release the weights for the following configurations:
| Model name | Number of layers | Attention Heads | Embedding Dimension | Total Parameters |
| :------: | :---: | :---: | :---: | :---: |
| `gpt-fr-cased-small` | 12 | 12 | 768 | 124 M |
| `gpt-fr-cased-base` | 24 | 14 | 1,792 | 1,017 B |
## Intended uses & limitations
The model can be leveraged for language generation tasks. Besides, many tasks may be formatted such that the output is directly generated in natural language. Such configuration may be used for tasks such as automatic summary or question answering. We do hope our model might be used for both academic and industrial applications.
#### How to use
The model might be used through the astonishing 🤗 `Transformers` librairie:
```python
from transformers import GPT2Tokenizer, GPT2LMHeadModel
# Load pretrained model and tokenizer
model = GPT2LMHeadModel.from_pretrained("asi/gpt-fr-cased-small")
tokenizer = GPT2Tokenizer.from_pretrained("asi/gpt-fr-cased-small")
# Generate a sample of text
model.eval()
input_sentence = "Longtemps je me suis couché de bonne heure."
input_ids = tokenizer.encode(input_sentence, return_tensors='pt')
beam_outputs = model.generate(
input_ids,
max_length=100,
do_sample=True,
top_k=50,
top_p=0.95,
num_return_sequences=1
)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_outputs[0], skip_special_tokens=True))
```
#### Limitations and bias
Large language models tend to replicate the biases found in pre-training datasets, such as gender discrimination or offensive content generation.
To limit exposition to too much explicit material, we carefully choose the sources beforehand. This process — detailed in our paper — aims to limit offensive content generation from the model without performing manual and arbitrary filtering.
However, some societal biases, contained in the data, might be reflected by the model. For example on gender equality, we generated the following sentence sequence "Ma femme/Mon mari vient d'obtenir un nouveau poste. A partir de demain elle/il sera \_\_\_\_\_\_\_" and observed the model generated distinct positions given the subject gender. We used top-k random sampling strategy with k=50 and stopped at the first punctuation element.
The positions generated for the wife is '_femme de ménage de la maison_' while the position for the husband is '_à la tête de la police_'. We do appreciate your feedback to better qualitatively and quantitatively assess such effects.
## Training data
We created a dedicated corpus to train our generative model. Indeed the model uses a fixed-length context size of 1,024 and require long documents to be trained. We aggregated existing corpora: [Wikipedia](https://dumps.wikimedia.org/frwiki/), [OpenSubtitle](http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2016/mono/) ([Tiedemann, 2012](#tiedemann-2012)), [Gutenberg](http://www.gutenberg.org). Corpora are filtered and separated into sentences. Successive sentences are then concatenated within the limit of 1,024 tokens per document.
## Training procedure
We pre-trained the model on a TPU v2-8 using the amazing [Google Colab](https://colab.research.google.com) inter-server.
## Eval results
We packaged **GPT-fr** with a dedicated language model evaluation benchmark.
In line with the [WikiText](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) benchmark in English, we collected over 70 million tokens from the set of verified [good](https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Articles_de_qualit%C3%A9) and [featured](https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Bons_articles) articles on French Wikipedia. The model reaches a zero-shot perplexity of **109.2** on the test set.
### BibTeX entry and citation info
Along with the model hosted by HuggingFace transformers library, we maintain a [git repository](https://github.com/AntoineSimoulin/gpt-fr).
If you use **GPT-fr** for your scientific publications or your industrial applications, please cite the following paper:
```bibtex
@inproceedings{simoulin:hal-03265900,
TITLE = {{Un mod{\`e}le Transformer G{\'e}n{\'e}ratif Pr{\'e}-entrain{\'e} pour le \_\_\_\_\_\_ fran{\c c}ais}},
AUTHOR = {Simoulin, Antoine and Crabb{\'e}, Benoit},
URL = {https://hal.archives-ouvertes.fr/hal-03265900},
BOOKTITLE = {{Traitement Automatique des Langues Naturelles}},
ADDRESS = {Lille, France},
EDITOR = {Denis, Pascal and Grabar, Natalia and Fraisse, Amel and Cardon, R{\'e}mi and Jacquemin, Bernard and Kergosien, Eric and Balvet, Antonio},
PUBLISHER = {{ATALA}},
PAGES = {246-255},
YEAR = {2021},
KEYWORDS = {fran{\c c}ais. ; GPT ; G{\'e}n{\'e}ratif ; Transformer ; Pr{\'e}-entra{\^i}n{\'e}},
PDF = {https://hal.archives-ouvertes.fr/hal-03265900/file/7.pdf},
HAL_ID = {hal-03265900},
HAL_VERSION = {v1},
}
```
### References
><div name="tiedemann-2012">Jörg Tiedemann: Parallel Data, Tools and Interfaces in OPUS. LREC 2012: 2214-2218</div> |