A small German GPT2.
Also check out GerPT2-large, a large version of this model.
Comparison to dbmdz/german-gpt2
|CC-100 (PPL)||Wikipedia (PPL)|
See the script
evaluate.py in the GerPT2 Github repository for the code.
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline tokenizer = AutoTokenizer.from_pretrained("benjamin/gerpt2") model = AutoModelForCausalLM.from_pretrained("benjamin/gerpt2") prompt = "<your prompt>" pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) print(pipe(prompt)["generated_text"])
Also, two tricks might improve the generated text:
output = model.generate( # during training an EOS token was used to mark the beginning of each text # so it can help to insert it at the start torch.tensor( [tokenizer.eos_token_id] + tokenizer.encode(prompt) ).unsqueeze(0), do_sample=True, # try setting bad_words_ids=[] to disallow generating an EOS token, without this the model is # prone to ending generation early because a significant number of texts from the training corpus # is quite short bad_words_ids=[], max_length=max_length, ) print(tokenizer.decode(output))
- a batch size of 256
- using OneCycle learning rate with a maximum of 5e-3
- with AdamW with a weight decay of 0.01
- for 7 epochs
Training took roughly 6 days on 8 TPUv3 cores.
To train GerPT2, follow these steps. Scripts are located in the Github repository:
- Download and unzip training data from http://data.statmt.org/cc-100/.
- Train a tokenizer using
prepare/train_tokenizer.py. As training data for the tokenizer I used a random subset of 5% of the CC-100 data.
- (optionally) generate a German input embedding matrix with
prepare/generate_aligned_wte.py. This uses a neat trick to semantically map tokens from the English tokenizer to tokens from the German tokenizer using aligned word embeddings. E. g.:
ĠMinde -> Ġleast Ġjed -> Ġwhatsoever flughafen -> Air vermittlung -> employment teilung -> ignment ĠInterpretation -> Ġinterpretation Ġimport -> Ġimported hansa -> irl genehmigungen -> exempt ĠAuflist -> Ġlists Ġverschwunden -> Ġdisappeared ĠFlyers -> ĠFlyers Kanal -> Channel Ġlehr -> Ġteachers Ġnahelie -> Ġconvenient gener -> Generally mitarbeiter -> staff
This helps a lot on a trial run I did, although I wasn't able to do a full comparison due to budget and time constraints. To use this WTE matrix it can be passed via the
wte_path to the training script. Credit to this blogpost for the idea of initializing GPT2 from English weights.
- Tokenize the corpus using
prepare/tokenize_text.py. This generates files for train and validation tokens in JSON Lines format.
- Run the training script
run.shshows how this was executed for the full run with config
GerPT2 is licensed under the MIT License.
- Downloads last month