lighteternal commited on
Commit
496bab2
1 Parent(s): 504b253

Initial commit

Browse files
GPT2el.png ADDED
README.md ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ language:
4
+ - el
5
+ tags:
6
+ - pytorch
7
+ - causal-lm
8
+ widget:
9
+ - text: "Μια φορά κι έναν καιρό"
10
+ license: apache-2.0
11
+
12
+ ---
13
+
14
+ ## Greek (el) GPT2 model - small
15
+ ## By the Hellenic Army Academy (SSE) and the Technical University of Crete (TUC)
16
+
17
+ * language: el
18
+ * licence: apache-2.0
19
+ * dataset: ~5GB of Greek corpora
20
+ * model: GPT2 (12-layer, 768-hidden, 12-heads, 117M parameters. OpenAI GPT-2 English model, finetuned for the Greek language)
21
+ * pre-processing: tokenization + BPE segmentation
22
+
23
+ ### Model description
24
+
25
+ A text generation (autoregressive) model, using Huggingface transformers and fastai based on the English GPT-2(small).
26
+ Finetuned with gradual layer unfreezing. This is a more efficient and sustainable alternative compared to training from scratch, especially for low-resource languages.
27
+ Based on the work of Thomas Dehaene (ML6) for the creation of a Dutch GPT2: https://colab.research.google.com/drive/1Y31tjMkB8TqKKFlZ5OJ9fcMp3p8suvs4?usp=sharing
28
+
29
+
30
+ ### How to use
31
+
32
+ ```
33
+ from transformers import pipeline
34
+
35
+ model = "lighteternal/gpt2-finetuned-greek-small"
36
+
37
+ generator = pipeline(
38
+ 'text-generation',
39
+ device=0,
40
+ model=f'{model}',
41
+ tokenizer=f'{model}')
42
+
43
+ text = "Μια φορά κι έναν καιρό"
44
+
45
+ print("\n".join([x.get("generated_text") for x in generator(
46
+ text,
47
+ max_length=len(text.split(" "))+15,
48
+ do_sample=True,
49
+ top_k=50,
50
+ repetition_penalty = 1.2,
51
+ add_special_tokens=False,
52
+ num_return_sequences=5,
53
+ temperature=0.95,
54
+ top_p=0.95)]))
55
+
56
+ ```
57
+
58
+
59
+ ## Training data
60
+
61
+ We used a small (~5MB) sample from a consolidated Greek corpus from CC100, Wikimatrix, Tatoeba, Books, SETIMES and GlobalVoices.
62
+
63
+
64
+
65
+ ### BibTeX entry and citation info
66
+ Based on the work of Thomas Dehaene (ML6): https://blog.ml6.eu/dutch-gpt2-autoregressive-language-modelling-on-a-budget-cff3942dd020
config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "activation_function": "gelu_new",
3
+ "architectures": [
4
+ "GPT2LMHeadModel"
5
+ ],
6
+ "attn_pdrop": 0.1,
7
+ "bos_token_id": 50256,
8
+ "embd_pdrop": 0.1,
9
+ "eos_token_id": 50256,
10
+ "initializer_range": 0.02,
11
+ "layer_norm_epsilon": 1e-05,
12
+ "model_type": "gpt2",
13
+ "n_ctx": 1024,
14
+ "n_embd": 768,
15
+ "n_head": 12,
16
+ "n_inner": null,
17
+ "n_layer": 12,
18
+ "n_positions": 1024,
19
+ "resid_pdrop": 0.1,
20
+ "summary_activation": null,
21
+ "summary_first_dropout": 0.1,
22
+ "summary_proj_to_labels": true,
23
+ "summary_type": "cls_index",
24
+ "summary_use_proj": true,
25
+ "task_specific_params": {
26
+ "text-generation": {
27
+ "do_sample": true,
28
+ "max_length": 50
29
+ }
30
+ },
31
+ "vocab_size": 50257
32
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:07df6ef06501ec3bdb5cdd8d0cda532f2f4e9336c5246fdec21cbe059157f5bb
3
+ size 510404461
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
1
+ {"bos_token": "<|endoftext|>", "eos_token": "<|endoftext|>", "unk_token": "<|endoftext|>", "pad_token": "<|endoftext|>"}
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
1
+ {"pad_token": "<|endoftext|>", "special_tokens_map_file": null, "full_tokenizer_file": null}
vocab.json ADDED
The diff for this file is too large to render. See raw diff