Greek (el) GPT2 model
By the Hellenic Army Academy (SSE) and the Technical University of Crete (TUC)
- language: el
- licence: apache-2.0
- dataset: ~23.4 GB of Greek corpora
- model: GPT2 (12-layer, 768-hidden, 12-heads, 117M parameters. OpenAI GPT-2 English model, finetuned for the Greek language)
- pre-processing: tokenization + BPE segmentation
- metrics: perplexity
Model description
A text generation (autoregressive) model, using Huggingface transformers and fastai based on the English GPT-2.
Finetuned with gradual layer unfreezing. This is a more efficient and sustainable alternative compared to training from scratch, especially for low-resource languages.
Based on the work of Thomas Dehaene (ML6) for the creation of a Dutch GPT2: https://colab.research.google.com/drive/1Y31tjMkB8TqKKFlZ5OJ9fcMp3p8suvs4?usp=sharing
How to use
from transformers import pipeline
model = "lighteternal/gpt2-finetuned-greek"
generator = pipeline(
'text-generation',
device=0,
model=f'{model}',
tokenizer=f'{model}')
text = "Μια φορά κι έναν καιρό"
print("\
".join([x.get("generated_text") for x in generator(
text,
max_length=len(text.split(" "))+15,
do_sample=True,
top_k=50,
repetition_penalty = 1.2,
add_special_tokens=False,
num_return_sequences=5,
temperature=0.95,
top_p=0.95)]))
Training data
We used a 23.4GB sample from a consolidated Greek corpus from CC100, Wikimatrix, Tatoeba, Books, SETIMES and GlobalVoices containing long senquences. This is a better version of our GPT-2 small model (https://huggingface.co/lighteternal/gpt2-finetuned-greek-small)
Metrics
Metric | Value |
---|---|
Train Loss | 3.67 |
Validation Loss | 3.83 |
Perplexity | 39.12 |
Acknowledgement
The research work was supported by the Hellenic Foundation for Research and Innovation (HFRI) under the HFRI PhD Fellowship grant (Fellowship Number:50, 2nd call)
Based on the work of Thomas Dehaene (ML6): https://blog.ml6.eu/dutch-gpt2-autoregressive-language-modelling-on-a-budget-cff3942dd020
- Downloads last month
- 221