A large German GPT2.

Also check out GerPT2, a small version of this model.

See the GPT2 model card for considerations on limitations and bias. See the GPT2 documentation for details on GPT2.

Comparison to dbmdz/german-gpt2

I evaluated both GerPT2-large and the other German GPT2, dbmdz/german-gpt2 on the CC-100 dataset and on the German Wikipedia:

CC-100 (PPL) Wikipedia (PPL)
dbmdz/german-gpt2 49.47 62.92
GerPT2 24.78 35.33
GerPT2-large 16.08 23.26

See the script in the GerPT2 Github repository for the code.


from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

tokenizer = AutoTokenizer.from_pretrained("benjamin/gerpt2-large")
model = AutoModelForCausalLM.from_pretrained("benjamin/gerpt2-large")

prompt = "<your prompt>"

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

Also, two tricks might improve the generated text:

output = model.generate(
    # during training an EOS token was used to mark the beginning of each text
    # so it can help to insert it at the start
        [tokenizer.eos_token_id] + tokenizer.encode(prompt)
    # try setting bad_words_ids=[[0]] to disallow generating an EOS token, without this the model is
    # prone to ending generation early because a significant number of texts from the training corpus
    # is quite short

Training details

GerPT2-large is trained on the entire German data (67GB) from the CC-100 Corpus and weights were initialized from the English GPT2 model. GerPT2-large was trained with:

  • a batch size of 256
  • using OneCycle learning rate with a maximum of 5e-3
  • with AdamW with a weight decay of 0.01
  • for 2 epochs

Training took roughly 12 days on 8 TPUv3 cores.

To train GerPT2-large, follow these steps. Scripts are located in the Github repository:

  1. Download and unzip training data from
  2. Train a tokenizer using prepare/ As training data for the tokenizer I used a random subset of 5% of the CC-100 data.
  3. (optionally) generate a German input embedding matrix with prepare/ This uses a neat trick to semantically map tokens from the English tokenizer to tokens from the German tokenizer using aligned word embeddings. E. g.:
ĠMinde -> Ġleast
Ġjed -> Ġwhatsoever
flughafen -> Air
vermittlung -> employment
teilung -> ignment
ĠInterpretation -> Ġinterpretation
Ġimport -> Ġimported
hansa -> irl
genehmigungen -> exempt
ĠAuflist -> Ġlists
Ġverschwunden -> Ġdisappeared
ĠFlyers -> ĠFlyers
Kanal -> Channel
Ġlehr -> Ġteachers
Ġnahelie -> Ġconvenient
gener -> Generally
mitarbeiter -> staff

This helps a lot on a trial run I did, although I wasn't able to do a full comparison due to budget and time constraints. To use this WTE matrix it can be passed via the wte_path to the training script. Credit to this blogpost for the idea of initializing GPT2 from English weights.

  1. Tokenize the corpus using prepare/ This generates files for train and validation tokens in JSON Lines format.
  2. Run the training script! shows how this was executed for the full run with config configs/tpu_large.json.


GerPT2-large is licensed under the MIT License.


Thanks to Hugging Face for awesome tools and infrastructure. Huge thanks to Artus Krohn-Grimberghe at LYTiQ for making this possible by sponsoring the resources used for training.

Downloads last month
Hosted inference API
Text Generation