lighteternal's picture
Initial commit
496bab2
metadata
language:
  - el
tags:
  - pytorch
  - causal-lm
widget:
  - text: Μια φορά κι έναν καιρό
license: apache-2.0

Greek (el) GPT2 model - small

By the Hellenic Army Academy (SSE) and the Technical University of Crete (TUC)

  • language: el
  • licence: apache-2.0
  • dataset: ~5GB of Greek corpora
  • model: GPT2 (12-layer, 768-hidden, 12-heads, 117M parameters. OpenAI GPT-2 English model, finetuned for the Greek language)
  • pre-processing: tokenization + BPE segmentation

Model description

A text generation (autoregressive) model, using Huggingface transformers and fastai based on the English GPT-2(small). Finetuned with gradual layer unfreezing. This is a more efficient and sustainable alternative compared to training from scratch, especially for low-resource languages. Based on the work of Thomas Dehaene (ML6) for the creation of a Dutch GPT2: https://colab.research.google.com/drive/1Y31tjMkB8TqKKFlZ5OJ9fcMp3p8suvs4?usp=sharing

How to use

from transformers import pipeline

model = "lighteternal/gpt2-finetuned-greek-small"

generator = pipeline(
    'text-generation',
    device=0,
    model=f'{model}',
    tokenizer=f'{model}')
    
text = "Μια φορά κι έναν καιρό"

print("\n".join([x.get("generated_text") for x in generator(
    text,
    max_length=len(text.split(" "))+15,
    do_sample=True,
    top_k=50,
    repetition_penalty = 1.2,
    add_special_tokens=False,
    num_return_sequences=5,
    temperature=0.95,
    top_p=0.95)]))
    

Training data

We used a small (~5MB) sample from a consolidated Greek corpus from CC100, Wikimatrix, Tatoeba, Books, SETIMES and GlobalVoices.

BibTeX entry and citation info

Based on the work of Thomas Dehaene (ML6): https://blog.ml6.eu/dutch-gpt2-autoregressive-language-modelling-on-a-budget-cff3942dd020