asi commited on
Commit
e48f7c4
1 Parent(s): f3ce324

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +103 -0
README.md CHANGED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - fr
4
+
5
+ thumbnail: https://raw.githubusercontent.com/AntoineSimoulin/gpt-fr/main/imgs/logo.png?token=ACXKU6CWEZIYA65LXAOFQQ3ASFTMG
6
+ tags:
7
+ - Tensorflow
8
+ - PyTroch
9
+ - gpt2
10
+ license: apache-2.0
11
+ ---
12
+
13
+ # GPT-fr
14
+
15
+ ## Model description
16
+
17
+ <img src="imgs/logo.png" width="200">
18
+
19
+ **GPT-fr** is a French GPT trained on a very large and heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for Scientific Research) [Jean Zay](http://www.idris.fr/eng/jean-zay/) supercomputer. We released the pre-trained weights for the following model sizes:
20
+
21
+ | Model name | Number of layers | Attention Heads | Embedding Dimension | Total Parameters |
22
+ | :------: | :---: | :---: | :---: | :---: |
23
+ | `gpt-fr-cased-small` | 12 | 12 | 768 | 124 M |
24
+ | `gpt-fr-cased-base` | 24 | 14 | 1792 | 1,017 B |
25
+
26
+ ## Intended uses & limitations
27
+
28
+ GPT is a generative model which can be leveraged for language generation tasks. Besides, many tasks may be formatted such that the output is directly generated in natural language. Such configuration may be used for automatic summary or question answering tasks.
29
+
30
+ #### How to use
31
+
32
+ ```python
33
+ import torch
34
+ from transformers import GPT2Tokenizer, GPT2LMHeadModel
35
+
36
+ # Load pretrained model and tokenizer
37
+ model = GPT2LMHeadModel.from_pretrained("asi/gpt-fr-cased-small")
38
+ tokenizer = GPT2Tokenizer.from_pretrained("asi/gpt-fr-cased-small")
39
+
40
+ # Generate a sample of text
41
+ model.eval()
42
+ input_sentence = "Longtemps je me suis couché de bonne heure."
43
+ input_ids = tokenizer.encode(input_sentence, return_tensors='pt')
44
+
45
+ beam_outputs = model.generate(
46
+ input_ids,
47
+ max_length=200,
48
+ do_sample=True,
49
+ top_k=50,
50
+ max_length=100,
51
+ top_p=0.95,
52
+ num_return_sequences=1
53
+ )
54
+
55
+ print("Output:\n" + 100 * '-')
56
+ print(tokenizer.decode(beam_outputs[0], skip_special_tokens=True))
57
+ ```
58
+
59
+ #### Limitations and bias
60
+
61
+ Large pre-trained language models tend to reproduce the biases from the dataset used for pre-training, in particular gender discrimination. We sought to qualitatively assess the potential biases learned by the model. For example, we generated the following sentence sequence with the model using the top-k random sampling strategy with k=50 and stopping at the first punctuation element. "Ma femme/Mon mari vient d'obtenir un nouveau poste en tant qu'\_\_\_\_\_\_":
62
+
63
+ The position generated for the wife are:
64
+
65
+ 1: Ma femme vient d'obtenir un nouveau poste en tant qu'`aide-soignante`.
66
+
67
+ 2: Ma femme vient d'obtenir un nouveau poste en tant qu'`agent immobiliser`.
68
+
69
+ 3: Ma femme vient d'obtenir un nouveau poste en tant qu'`assistante de direction`.
70
+
71
+ 4: Ma femme vient d'obtenir un nouveau poste en tant qu'`aide-soignante à la maison`.
72
+
73
+ The position generated for the husband are:
74
+
75
+ 1: Mon mari vient d'obtenir un nouveau poste en tant qu'`ingénieur de recherches au Centre de recherche sur les orages magnétiques (CRC)`.
76
+
77
+ 2: Mon mari vient d'obtenir un nouveau poste en tant qu'`maire d'Asnières`.
78
+
79
+ 3: Mon mari vient d'obtenir un nouveau poste en tant qu'`vice-président senior des opérations générales`.
80
+
81
+ 4: Mon mari vient d'obtenir un nouveau poste en tant qu'`journaliste et chef d'état-major`.
82
+
83
+ ## Training data
84
+
85
+ We created a dedicated corpus to train our generative model. Indeed the model uses a fixed-length context size of 1,024 and require long documents to be trained. We aggregated existing corpora: Wikipedia, OpenSubtitle (Tiedemann, 2012), Gutenberg and Common Crawl (Li et al., 2019). Corpora are filtered and separated into sentences. Successive sentences are then concatenated within the limit of 1024 tokens per document.
86
+
87
+ ## Training procedure
88
+
89
+ We pre-trained the model on a TPU v2-8 using the Google Colab inter-server.
90
+
91
+ ## Eval results
92
+
93
+ We packaged **GPT-fr** with a dedicated language model evaluation benchmark for French.
94
+ In line with the [WikiText](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) benchmark in English, we collected over 70 million tokens from the set of verified [Good](https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Articles_de_qualit%C3%A9) and [Featured articles](https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Bons_articles) on Wikipedia. The model reaches a zero-shot perplexity of **109.2** on the test set.
95
+
96
+
97
+ ### BibTeX entry and citation info
98
+
99
+ ```bibtex
100
+ @inproceedings{...,
101
+ year={2020}
102
+ }
103
+ ```