gorkemgoknar
commited on
Commit
•
d1dc31a
1
Parent(s):
02d36dd
Update README.md
Browse filesadded model card and details
README.md
CHANGED
@@ -1,2 +1,138 @@
|
|
1 |
-
|
2 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- tr
|
4 |
+
thumbnail:
|
5 |
+
tags:
|
6 |
+
- gpt2
|
7 |
+
- turkish
|
8 |
+
- aiwriter
|
9 |
+
- finetuned
|
10 |
+
|
11 |
+
license: Apache 2.0
|
12 |
+
datasets:
|
13 |
+
- wikipedia-turkish
|
14 |
+
- custom-book-corpus
|
15 |
+
metrics:
|
16 |
+
- perplexity
|
17 |
+
- accuracy
|
18 |
+
|
19 |
+
widget:
|
20 |
+
- text: "Bir zaman topu olan ama köpeği olmayan bir çocuk vardı. Parkta"
|
21 |
+
context: ""
|
22 |
+
- text: "Uzun uzun sahile doğru baktı. Düşündüklerinden "
|
23 |
+
context: ""
|
24 |
+
- text: "Çok uzun zaman önce galaksinin uzak bir köşesinde..."
|
25 |
+
context: ""
|
26 |
+
- text: "'Bugün kendimi çok hasta hissediyorum' dedi. Karşısında "
|
27 |
+
context: ""
|
28 |
+
---
|
29 |
+
|
30 |
+
# MyModel
|
31 |
+
|
32 |
+
## Model description
|
33 |
+
|
34 |
+
This model is enhanced version of gpt2-small-turkish finetuned version. In addition to 28-10-2020 Wikipedia Turkish article dump this model is trained with more than 400 classic novels and plays in Turkish (Including Dostoyevski, Shaekspeare, Dumas)
|
35 |
+
|
36 |
+
Base work has been done on Pierre Guillou tutorial as on this page.
|
37 |
+
(https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb)
|
38 |
+
|
39 |
+
Note that Since Turkish language is not close to English as in Porteguese instead of training last 2 layers, last 3 layers are trained.
|
40 |
+
|
41 |
+
Code is converted to work with Fastai 2.X .
|
42 |
+
Using Google Colab for training.
|
43 |
+
|
44 |
+
Current accuracy 36.3 % , Perplexity : 44.75
|
45 |
+
|
46 |
+
Models are available:
|
47 |
+
|
48 |
+
* [gpt2-small-tuned-tr] (https://huggingface.co/gorkemgoknar/gpt2-small-turkish)
|
49 |
+
* [gpt2-small-turkish-writer] (https://huggingface.co/gorkemgoknar/gpt2-turkish-writer)
|
50 |
+
|
51 |
+
## Intended uses & limitations
|
52 |
+
|
53 |
+
#### How to use
|
54 |
+
|
55 |
+
#### Install
|
56 |
+
|
57 |
+
```python
|
58 |
+
from transformers import AutoTokenizer, AutoModelWithLMHead
|
59 |
+
import torch
|
60 |
+
|
61 |
+
tokenizer = AutoTokenizer.from_pretrained("gorkemgoknar/gpt2-turkish-writer")
|
62 |
+
model = AutoModelWithLMHead.from_pretrained("gorkemgoknar/gpt2-turkish-writer")
|
63 |
+
|
64 |
+
# Get sequence length max of 1024
|
65 |
+
tokenizer.model_max_length=1024
|
66 |
+
|
67 |
+
model.eval() # disable dropout (or leave in train mode to finetune)
|
68 |
+
|
69 |
+
```
|
70 |
+
|
71 |
+
#### Generate 1 word
|
72 |
+
```python
|
73 |
+
# input sequence
|
74 |
+
text = "Bu yazıyı bilgisayar yazdı."
|
75 |
+
inputs = tokenizer(text, return_tensors="pt") #need pt will be corrected to tr
|
76 |
+
|
77 |
+
# model output
|
78 |
+
outputs = model(**inputs, labels=inputs["input_ids"])
|
79 |
+
loss, logits = outputs[:2]
|
80 |
+
predicted_index = torch.argmax(logits[0, -1, :]).item()
|
81 |
+
predicted_text = tokenizer.decode([predicted_index])
|
82 |
+
|
83 |
+
# results
|
84 |
+
print('input text:', text)
|
85 |
+
print('predicted text:', predicted_text)
|
86 |
+
|
87 |
+
# input text:
|
88 |
+
# predicted text:
|
89 |
+
|
90 |
+
```
|
91 |
+
|
92 |
+
#### Generate Full Sequence
|
93 |
+
```python
|
94 |
+
# input sequence
|
95 |
+
text = "Bu yazıyı bilgisayar yazdı."
|
96 |
+
inputs = tokenizer(text, return_tensors="pt") #need pt will be corrected to tr
|
97 |
+
|
98 |
+
# model output using Top-k sampling text generation method
|
99 |
+
sample_outputs = model.generate(inputs.input_ids,
|
100 |
+
pad_token_id=50256,
|
101 |
+
do_sample=True,
|
102 |
+
max_length=50, # put the token number you want
|
103 |
+
top_k=40,
|
104 |
+
num_return_sequences=1)
|
105 |
+
|
106 |
+
# generated sequence
|
107 |
+
for i, sample_output in enumerate(sample_outputs):
|
108 |
+
print(">> Generated text {}\n\n{}".format(i+1, tokenizer.decode(sample_output.tolist())))
|
109 |
+
|
110 |
+
# >> Generated text
|
111 |
+
#
|
112 |
+
|
113 |
+
```
|
114 |
+
|
115 |
+
#### Limitations and bias
|
116 |
+
|
117 |
+
The training data used for this model come from Turkish Wikipedia and books. We know it contains a lot of unfiltered content from the internet, which is far from neutral. Also not much pre-processing was done on books hence chapter names and page numbers can be seen on some cases. This is a work in progress.
|
118 |
+
|
119 |
+
|
120 |
+
## Training data
|
121 |
+
|
122 |
+
Wikipedia Turkish article dump as of 28-10-2020
|
123 |
+
Turkish book dataset of >400 classic novels
|
124 |
+
|
125 |
+
## Training procedure
|
126 |
+
|
127 |
+
|
128 |
+
## Eval results
|
129 |
+
|
130 |
+
| epoch |train_loss |valid_loss |accuracy |perplexity |time |
|
131 |
+
| ----- | -------- |--------- | ---------- | --------- | ----- |
|
132 |
+
|0 |4.497828 |4.549605 |0.277328 |94.595070 |2:09:58|
|
133 |
+
|1 |4.503929 |4.519456 |0.275071 |91.785645 |2:04:30|
|
134 |
+
|2 |3.612716 |3.921146 |0.344802 |50.458256 |2:03:22|
|
135 |
+
|3 |3.777645 |4.072006 |0.326130 |58.674530 |1:56:14|
|
136 |
+
|4 |2.934462 |3.801303 |0.363719 |44.759476 |1:58:55|
|
137 |
+
```
|
138 |
+
|