gpt2-turkish-writer / README.md
1
---
2
language:
3
- tr
4
thumbnail:
5
tags:
6
- gpt2
7
- turkish
8
- aiwriter
9
- finetuned
10
11
license: apache-2.0
12
datasets:
13
- wikipedia-turkish
14
- custom-book-corpus
15
metrics:
16
- perplexity
17
- accuracy
18
19
widget:
20
- text: Bir zaman topu olan ama köpeği olmayan bir çocuk vardı. Parkta
21
  context: ''
22
- text: 'Uzun uzun sahile doğru baktı. Düşündüklerinden '
23
  context: ''
24
- text: Çok uzun zaman önce galaksinin uzak bir köşesinde...
25
  context: ''
26
- text: "'Bugün kendimi çok hasta hissediyorum' dedi. Karşısında "
27
  context: ''
28
---
29
30
# Turkish AI Writer based on GPT2-Small
31
# Türkçe Yapay Zeka Yazarı
32
33
## Model description
34
35
This model is enhanced version of gpt2-small-turkish finetuned version. In addition to 28-10-2020 Wikipedia Turkish article dump this model is trained with more than 400 classic novels and plays in Turkish (Including Dostoyevski, Shaekspeare, Dumas)
36
37
Base work has been done on Pierre Guillou tutorial as on this page.
38
(https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb) 
39
40
Note that Since Turkish language is not close to English as in Porteguese instead  of training last 2 layers, last 3 layers are trained.
41
42
Code is converted to work with Fastai 2.X .
43
Using Google Colab for training. 
44
45
Current accuracy 36.3 %  , Perplexity : 44.75
46
47
Demo (using CPU inference) is available on: http://www.metayazar.com 
48
49
Models are available:
50
51
* [gpt2-small-tuned-tr] (https://huggingface.co/gorkemgoknar/gpt2-small-turkish)
52
* [gpt2-small-turkish-writer] (https://huggingface.co/gorkemgoknar/gpt2-turkish-writer)
53
54
55
## Intended uses & limitations
56
57
#### How to use
58
59
#### Install
60
61
```python
62
from transformers import AutoTokenizer, AutoModelWithLMHead
63
import torch
64
65
tokenizer = AutoTokenizer.from_pretrained("gorkemgoknar/gpt2-turkish-writer")
66
model = AutoModelWithLMHead.from_pretrained("gorkemgoknar/gpt2-turkish-writer")
67
68
# Get sequence length max of 1024
69
tokenizer.model_max_length=1024 
70
71
model.eval()  # disable dropout (or leave in train mode to finetune)
72
73
```
74
75
#### Generate 1 word
76
```python
77
# input sequence
78
text = "Bu yazıyı bilgisayar yazdı."
79
inputs = tokenizer(text, return_tensors="pt") 
80
81
# model output
82
outputs = model(**inputs, labels=inputs["input_ids"])
83
loss, logits = outputs[:2]
84
predicted_index = torch.argmax(logits[0, -1, :]).item()
85
predicted_text = tokenizer.decode([predicted_index])
86
87
# results
88
print('input text:', text)
89
print('predicted text:', predicted_text)
90
91
# input text: 
92
# predicted text:  
93
94
```
95
96
#### Generate Full Sequence
97
```python
98
# input sequence
99
text = "Bu yazıyı bilgisayar yazdı."
100
inputs = tokenizer(text, return_tensors="pt")
101
102
# model output using Top-k sampling text generation method
103
sample_outputs = model.generate(inputs.input_ids,
104
                                pad_token_id=50256,
105
                                do_sample=True, 
106
                                max_length=50, # put the token number you want
107
                                top_k=40,
108
                                num_return_sequences=1)
109
110
# generated sequence
111
for i, sample_output in enumerate(sample_outputs):
112
    print(">> Generated text {}\n\n{}".format(i+1, tokenizer.decode(sample_output.tolist())))
113
114
# >> Generated text
115
#    
116
117
```
118
119
#### Limitations and bias
120
121
The training data used for this model come from Turkish Wikipedia and books. We know it contains a lot of unfiltered content from the internet, which is far from neutral. Also not much pre-processing was done on books hence chapter names and page numbers can be seen on some cases. This is a work in progress.
122
123
124
## Training data
125
126
Wikipedia Turkish article dump as of 28-10-2020
127
Turkish book dataset of >400 classic novels
128
129
## Training procedure
130
131
132
## Eval results
133
134
| epoch	|train_loss	|valid_loss	|accuracy	|perplexity	|time   |
135
| ----- | --------      |---------      | ----------    | ---------     | ----- |
136
|0	|4.497828	|4.549605	|0.277328	|94.595070	|2:09:58|
137
|1	|4.503929	|4.519456	|0.275071	|91.785645	|2:04:30|
138
|2	|3.612716	|3.921146	|0.344802	|50.458256	|2:03:22|
139
|3	|3.777645	|4.072006	|0.326130	|58.674530	|1:56:14|
140
|4	|2.934462	|3.801303	|0.363719	|44.759476	|1:58:55|
141
142
Note: 1cycle rule training is used and epochs are at different times 
143
```
144
145