julien-c HF staff commited on
Commit
17998e3
1 Parent(s): cadbd46

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/antoiloui/belgpt2/README.md

Files changed (1) hide show
  1. README.md +53 -0
README.md ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: "fr"
3
+ ---
4
+
5
+ # BelGPT-2
6
+
7
+ **BelGPT-2** (*Belgian GPT-2* 🇧🇪) is a "small" GPT-2 model pre-trained on a very large and heterogeneous French corpus (around 60Gb). Please check [antoiloui/gpt2-french](https://github.com/antoiloui/gpt2-french) for more information about the pre-trained model, the data, the code to use the model and the code to pre-train it.
8
+
9
+
10
+ ## Using BelGPT-2 for Text Generation in French
11
+
12
+ You can use BelGPT-2 with [🤗 transformers](https://github.com/huggingface/transformers) library as follows:
13
+
14
+ ```python
15
+ import torch
16
+ from transformers import GPT2Tokenizer, GPT2LMHeadModel
17
+
18
+ # Load pretrained model and tokenizer
19
+ model = GPT2LMHeadModel.from_pretrained("antoiloui/belgpt2")
20
+ tokenizer = GPT2Tokenizer.from_pretrained("antoiloui/belgpt2")
21
+
22
+ # Generate a sample of text
23
+ model.eval()
24
+ output = model.generate(
25
+ bos_token_id=random.randint(1,50000),
26
+ do_sample=True,
27
+ top_k=50,
28
+ max_length=100,
29
+ top_p=0.95,
30
+ num_return_sequences=1
31
+ )
32
+
33
+ # Decode it
34
+ decoded_output = []
35
+ for sample in output:
36
+ decoded_output.append(tokenizer.decode(sample, skip_special_tokens=True))
37
+ print(decoded_output)
38
+ ```
39
+
40
+ ## Data
41
+
42
+ Below is the list of all French copora used to pre-trained the model:
43
+
44
+ | Dataset | `$corpus_name` | Raw size | Cleaned size |
45
+ | :------| :--- | :---: | :---: |
46
+ | CommonCrawl | `common_crawl` | 200.2 GB | 40.4 GB |
47
+ | NewsCrawl | `news_crawl` | 10.4 GB | 9.8 GB |
48
+ | Wikipedia | `wiki` | 19.4 GB | 4.1 GB |
49
+ | Wikisource | `wikisource` | 4.6 GB | 2.3 GB |
50
+ | Project Gutenberg | `gutenberg` | 1.3 GB | 1.1 GB |
51
+ | EuroParl | `europarl` | 289.9 MB | 278.7 MB |
52
+ | NewsCommentary | `news_commentary` | 61.4 MB | 58.1 MB |
53
+ | **Total** | | **236.3 GB** | **57.9 GB** |