yhavinga
/

gpt2-medium-dutch

@@ -14,11 +14,25 @@ datasets:
 ---
 # GPT2-Medium pre-trained on cleaned Dutch mC4 🇳🇱
-A GPT2 medium sized model (345M parameters) trained from scratch on Dutch, with perplexity 15.2 on cleaned Dutch mC4.
 ## Tokenizer
-* Tokenizer trained from scratch for Dutch on mC4 nl cleaned with scripts from the Huggingface
   Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling).
 ## Dataset
@@ -34,22 +48,31 @@ which is the original mC4, except
   * Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
     "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
-## Training details
-* Trained for 320K of 520K steps (61%, 20B tokens)
-* Block size: 512
-* Optimizer: adam, lr 8e-4, beta1 0.9, beta2 0.98
-* Warmup steps: 5000
-* Weight decay: 0.01
 ## Acknowledgements
 This project would not have been possible without compute generously provided by Google through the
 [TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem was also
-instrumental in many, if not all parts of the training. The following repositories where helpful in setting up the TPU-VM,
-and getting an idea what sensible hyper-parameters are for training gpt2 from scratch.
-* [t5-flax-gcp repository](https://github.com/gsarti/t5-flax-gcp)
 * [gpt2-medium-persian](https://huggingface.co/flax-community/gpt2-medium-persian)
 * [gpt2-medium-indonesian](https://huggingface.co/flax-community/gpt2-medium-persian)
-* [language model training examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)

 ---
 # GPT2-Medium pre-trained on cleaned Dutch mC4 🇳🇱
+A GPT2 medium-sized model (345M parameters) trained from scratch on Dutch, with perplexity 15.1 on cleaned Dutch mC4.
+## How To Use
+You can use this GPT2-model directly with a pipeline for text generation.
+```python
+MODEL_DIR='yhavinga/gpt2-medium-dutch'
+from transformers import pipeline, GPT2Tokenizer, GPT2LMHeadModel
+tokenizer = GPT2Tokenizer.from_pretrained(MODEL_DIR)
+model = GPT2LMHeadModel.from_pretrained(MODEL_DIR)
+generator = pipeline('text-generation', model, tokenizer=tokenizer, config={'max_length':100})
+generated_text = generator('Wat is de zin van het leven?', max_length=100, do_sample=True, top_k=40, top_p=0.95, repetition_penalty=2.0))
+```
 ## Tokenizer
+* BPE tokenizer trained from scratch for Dutch on mC4 nl cleaned with scripts from the Huggingface
   Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling).
 ## Dataset
   * Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
     "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
+## Models
+TL;DR: [yhavinga/gpt2-medium-dutch](https://huggingface.co/yhavinga/gpt2-medium-dutch) is the best model.
+* `yhavinga/gpt-neo-125M-dutch` is trained on a fraction of C4 containing only wikipedia and news sites.
+* The models with `a`/`b` in the step-column have been trained to step `a` of a total of `b` steps.
+|                                                                                   | model   | params | train seq len | ppl  | loss | batch size | epochs | steps           | optim     | lr     | duration | config    |
+|-----------------------------------------------------------------------------------|---------|--------|---------------|------|------|------------|--------|-----------------|-----------|--------|----------|-----------|
+| [yhavinga/gpt-neo-125M-dutch](https://huggingface.co/yhavinga/gpt-neo-125M-dutch) | gpt neo | 125M   | 512           | 19.9 | 2.99 | 128        | 8      | 558608          | adamw     | 2.4e-3 | 1d 12h   | news+wiki |
+| [yhavinga/gpt2-medium-dutch](https://huggingface.co/yhavinga/gpt2-medium-dutch)   | gpt2    | 345M   | 512           | 15.1 | 2.71 | 128        | 4      | 320000/520502   | adafactor | 8e-4   | 7d 2h    | full      |
+| [yhavinga/gpt2-large-dutch](https://huggingface.co/yhavinga/gpt2-large-dutch)     | gpt2    | 762M   | 512           | 15.1 | 2.72 | 32         | 1      | 1100000/2082009 | adafactor | 3.3e-5 | 8d 15h   | large     |
+| [yhavinga/gpt-neo-1.3B-dutch](https://huggingface.co/yhavinga/gpt-neo-1.3B-dutch) | gpt neo | 1.3B   | 512           | 16.0 | 2.77 | 16         | 1      | 960000/3049896  | adafactor | 5e-4   | 7d 11h   | full      |
 ## Acknowledgements
 This project would not have been possible without compute generously provided by Google through the
 [TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem was also
+instrumental in most, if not all, parts of the training. The following repositories where helpful in setting up the TPU-VM,
+and training the models:
+* [Gsarti's Pretrain and Fine-tune a T5 model with Flax on GCP](https://github.com/gsarti/t5-flax-gcp)
+* [HUggingFace Flax MLM examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
 * [gpt2-medium-persian](https://huggingface.co/flax-community/gpt2-medium-persian)
 * [gpt2-medium-indonesian](https://huggingface.co/flax-community/gpt2-medium-persian)
+Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)