yhavinga commited on
Commit
50a124d
·
1 Parent(s): e0dfc71

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -12
README.md CHANGED
@@ -14,11 +14,25 @@ datasets:
14
  ---
15
  # GPT2-Medium pre-trained on cleaned Dutch mC4 🇳🇱
16
 
17
- A GPT2 medium sized model (345M parameters) trained from scratch on Dutch, with perplexity 15.2 on cleaned Dutch mC4.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
  ## Tokenizer
20
 
21
- * Tokenizer trained from scratch for Dutch on mC4 nl cleaned with scripts from the Huggingface
22
  Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling).
23
 
24
  ## Dataset
@@ -34,22 +48,31 @@ which is the original mC4, except
34
  * Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
35
  "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
36
 
37
- ## Training details
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
- * Trained for 320K of 520K steps (61%, 20B tokens)
40
- * Block size: 512
41
- * Optimizer: adam, lr 8e-4, beta1 0.9, beta2 0.98
42
- * Warmup steps: 5000
43
- * Weight decay: 0.01
44
 
45
  ## Acknowledgements
46
 
47
  This project would not have been possible without compute generously provided by Google through the
48
  [TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem was also
49
- instrumental in many, if not all parts of the training. The following repositories where helpful in setting up the TPU-VM,
50
- and getting an idea what sensible hyper-parameters are for training gpt2 from scratch.
51
 
52
- * [t5-flax-gcp repository](https://github.com/gsarti/t5-flax-gcp)
 
53
  * [gpt2-medium-persian](https://huggingface.co/flax-community/gpt2-medium-persian)
54
  * [gpt2-medium-indonesian](https://huggingface.co/flax-community/gpt2-medium-persian)
55
- * [language model training examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
 
 
14
  ---
15
  # GPT2-Medium pre-trained on cleaned Dutch mC4 🇳🇱
16
 
17
+ A GPT2 medium-sized model (345M parameters) trained from scratch on Dutch, with perplexity 15.1 on cleaned Dutch mC4.
18
+
19
+ ## How To Use
20
+
21
+ You can use this GPT2-model directly with a pipeline for text generation.
22
+
23
+ ```python
24
+ MODEL_DIR='yhavinga/gpt2-medium-dutch'
25
+ from transformers import pipeline, GPT2Tokenizer, GPT2LMHeadModel
26
+ tokenizer = GPT2Tokenizer.from_pretrained(MODEL_DIR)
27
+ model = GPT2LMHeadModel.from_pretrained(MODEL_DIR)
28
+ generator = pipeline('text-generation', model, tokenizer=tokenizer, config={'max_length':100})
29
+
30
+ generated_text = generator('Wat is de zin van het leven?', max_length=100, do_sample=True, top_k=40, top_p=0.95, repetition_penalty=2.0))
31
+ ```
32
 
33
  ## Tokenizer
34
 
35
+ * BPE tokenizer trained from scratch for Dutch on mC4 nl cleaned with scripts from the Huggingface
36
  Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling).
37
 
38
  ## Dataset
 
48
  * Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
49
  "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
50
 
51
+ ## Models
52
+
53
+ TL;DR: [yhavinga/gpt2-medium-dutch](https://huggingface.co/yhavinga/gpt2-medium-dutch) is the best model.
54
+
55
+ * `yhavinga/gpt-neo-125M-dutch` is trained on a fraction of C4 containing only wikipedia and news sites.
56
+ * The models with `a`/`b` in the step-column have been trained to step `a` of a total of `b` steps.
57
+
58
+ | | model | params | train seq len | ppl | loss | batch size | epochs | steps | optim | lr | duration | config |
59
+ |-----------------------------------------------------------------------------------|---------|--------|---------------|------|------|------------|--------|-----------------|-----------|--------|----------|-----------|
60
+ | [yhavinga/gpt-neo-125M-dutch](https://huggingface.co/yhavinga/gpt-neo-125M-dutch) | gpt neo | 125M | 512 | 19.9 | 2.99 | 128 | 8 | 558608 | adamw | 2.4e-3 | 1d 12h | news+wiki |
61
+ | [yhavinga/gpt2-medium-dutch](https://huggingface.co/yhavinga/gpt2-medium-dutch) | gpt2 | 345M | 512 | 15.1 | 2.71 | 128 | 4 | 320000/520502 | adafactor | 8e-4 | 7d 2h | full |
62
+ | [yhavinga/gpt2-large-dutch](https://huggingface.co/yhavinga/gpt2-large-dutch) | gpt2 | 762M | 512 | 15.1 | 2.72 | 32 | 1 | 1100000/2082009 | adafactor | 3.3e-5 | 8d 15h | large |
63
+ | [yhavinga/gpt-neo-1.3B-dutch](https://huggingface.co/yhavinga/gpt-neo-1.3B-dutch) | gpt neo | 1.3B | 512 | 16.0 | 2.77 | 16 | 1 | 960000/3049896 | adafactor | 5e-4 | 7d 11h | full |
64
 
 
 
 
 
 
65
 
66
  ## Acknowledgements
67
 
68
  This project would not have been possible without compute generously provided by Google through the
69
  [TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem was also
70
+ instrumental in most, if not all, parts of the training. The following repositories where helpful in setting up the TPU-VM,
71
+ and training the models:
72
 
73
+ * [Gsarti's Pretrain and Fine-tune a T5 model with Flax on GCP](https://github.com/gsarti/t5-flax-gcp)
74
+ * [HUggingFace Flax MLM examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
75
  * [gpt2-medium-persian](https://huggingface.co/flax-community/gpt2-medium-persian)
76
  * [gpt2-medium-indonesian](https://huggingface.co/flax-community/gpt2-medium-persian)
77
+
78
+ Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)