Initial commit

Browse files

Files changed (8) hide show

GPT2el.png +0 -0
README.md +66 -0
config.json +32 -0
merges.txt +0 -0
pytorch_model.bin +3 -0
special_tokens_map.json +1 -0
tokenizer_config.json +1 -0
vocab.json +0 -0

GPT2el.png ADDED Viewed

README.md ADDED Viewed

	@@ -0,0 +1,66 @@

+---
+language:
+- el
+tags:
+- pytorch
+- causal-lm
+widget:
+- text: "Μια φορά κι έναν καιρό"
+license: apache-2.0
+---
+## Greek (el) GPT2 model - small
+## By the Hellenic Army Academy (SSE) and the Technical University of Crete (TUC)
+* language: el
+* licence: apache-2.0
+* dataset: ~5GB of Greek corpora
+* model: GPT2 (12-layer, 768-hidden, 12-heads, 117M parameters. OpenAI GPT-2 English model, finetuned for the Greek language)
+* pre-processing: tokenization + BPE segmentation
+### Model description
+A text generation (autoregressive) model, using Huggingface transformers and fastai based on the English GPT-2(small).
+Finetuned with gradual layer unfreezing. This is a more efficient and sustainable alternative compared to training from scratch, especially for low-resource languages.
+Based on the work of Thomas Dehaene (ML6) for the creation of a Dutch GPT2: https://colab.research.google.com/drive/1Y31tjMkB8TqKKFlZ5OJ9fcMp3p8suvs4?usp=sharing
+### How to use
+```
+from transformers import pipeline
+model = "lighteternal/gpt2-finetuned-greek-small"
+generator = pipeline(
+    'text-generation',
+    device=0,
+    model=f'{model}',
+    tokenizer=f'{model}')
+text = "Μια φορά κι έναν καιρό"
+print("\n".join([x.get("generated_text") for x in generator(
+    text,
+    max_length=len(text.split(" "))+15,
+    do_sample=True,
+    top_k=50,
+    repetition_penalty = 1.2,
+    add_special_tokens=False,
+    num_return_sequences=5,
+    temperature=0.95,
+    top_p=0.95)]))
+```
+## Training data
+We used a small (~5MB) sample from a consolidated Greek corpus from CC100, Wikimatrix, Tatoeba, Books, SETIMES and GlobalVoices.
+### BibTeX entry and citation info
+Based on the work of Thomas Dehaene (ML6): https://blog.ml6.eu/dutch-gpt2-autoregressive-language-modelling-on-a-budget-cff3942dd020

config.json ADDED Viewed

	@@ -0,0 +1,32 @@

+{
+  "activation_function": "gelu_new",
+  "architectures": [
+    "GPT2LMHeadModel"
+  ],
+  "attn_pdrop": 0.1,
+  "bos_token_id": 50256,
+  "embd_pdrop": 0.1,
+  "eos_token_id": 50256,
+  "initializer_range": 0.02,
+  "layer_norm_epsilon": 1e-05,
+  "model_type": "gpt2",
+  "n_ctx": 1024,
+  "n_embd": 768,
+  "n_head": 12,
+  "n_inner": null,
+  "n_layer": 12,
+  "n_positions": 1024,
+  "resid_pdrop": 0.1,
+  "summary_activation": null,
+  "summary_first_dropout": 0.1,
+  "summary_proj_to_labels": true,
+  "summary_type": "cls_index",
+  "summary_use_proj": true,
+  "task_specific_params": {
+    "text-generation": {
+      "do_sample": true,
+      "max_length": 50
+    }
+  },
+  "vocab_size": 50257
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:07df6ef06501ec3bdb5cdd8d0cda532f2f4e9336c5246fdec21cbe059157f5bb
+size 510404461

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"bos_token": "<\|endoftext\|>", "eos_token": "<\|endoftext\|>", "unk_token": "<\|endoftext\|>", "pad_token": "<\|endoftext\|>"}

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"pad_token": "<\|endoftext\|>", "special_tokens_map_file": null, "full_tokenizer_file": null}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff