gorkemgoknar commited on
Commit
d1dc31a
1 Parent(s): 02d36dd

Update README.md

Browse files

added model card and details

Files changed (1) hide show
  1. README.md +138 -2
README.md CHANGED
@@ -1,2 +1,138 @@
1
- gpt2-turkish-writer
2
- gpt2-turkish-writer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - tr
4
+ thumbnail:
5
+ tags:
6
+ - gpt2
7
+ - turkish
8
+ - aiwriter
9
+ - finetuned
10
+
11
+ license: Apache 2.0
12
+ datasets:
13
+ - wikipedia-turkish
14
+ - custom-book-corpus
15
+ metrics:
16
+ - perplexity
17
+ - accuracy
18
+
19
+ widget:
20
+ - text: "Bir zaman topu olan ama köpeği olmayan bir çocuk vardı. Parkta"
21
+ context: ""
22
+ - text: "Uzun uzun sahile doğru baktı. Düşündüklerinden "
23
+ context: ""
24
+ - text: "Çok uzun zaman önce galaksinin uzak bir köşesinde..."
25
+ context: ""
26
+ - text: "'Bugün kendimi çok hasta hissediyorum' dedi. Karşısında "
27
+ context: ""
28
+ ---
29
+
30
+ # MyModel
31
+
32
+ ## Model description
33
+
34
+ This model is enhanced version of gpt2-small-turkish finetuned version. In addition to 28-10-2020 Wikipedia Turkish article dump this model is trained with more than 400 classic novels and plays in Turkish (Including Dostoyevski, Shaekspeare, Dumas)
35
+
36
+ Base work has been done on Pierre Guillou tutorial as on this page.
37
+ (https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb)
38
+
39
+ Note that Since Turkish language is not close to English as in Porteguese instead of training last 2 layers, last 3 layers are trained.
40
+
41
+ Code is converted to work with Fastai 2.X .
42
+ Using Google Colab for training.
43
+
44
+ Current accuracy 36.3 % , Perplexity : 44.75
45
+
46
+ Models are available:
47
+
48
+ * [gpt2-small-tuned-tr] (https://huggingface.co/gorkemgoknar/gpt2-small-turkish)
49
+ * [gpt2-small-turkish-writer] (https://huggingface.co/gorkemgoknar/gpt2-turkish-writer)
50
+
51
+ ## Intended uses & limitations
52
+
53
+ #### How to use
54
+
55
+ #### Install
56
+
57
+ ```python
58
+ from transformers import AutoTokenizer, AutoModelWithLMHead
59
+ import torch
60
+
61
+ tokenizer = AutoTokenizer.from_pretrained("gorkemgoknar/gpt2-turkish-writer")
62
+ model = AutoModelWithLMHead.from_pretrained("gorkemgoknar/gpt2-turkish-writer")
63
+
64
+ # Get sequence length max of 1024
65
+ tokenizer.model_max_length=1024
66
+
67
+ model.eval() # disable dropout (or leave in train mode to finetune)
68
+
69
+ ```
70
+
71
+ #### Generate 1 word
72
+ ```python
73
+ # input sequence
74
+ text = "Bu yazıyı bilgisayar yazdı."
75
+ inputs = tokenizer(text, return_tensors="pt") #need pt will be corrected to tr
76
+
77
+ # model output
78
+ outputs = model(**inputs, labels=inputs["input_ids"])
79
+ loss, logits = outputs[:2]
80
+ predicted_index = torch.argmax(logits[0, -1, :]).item()
81
+ predicted_text = tokenizer.decode([predicted_index])
82
+
83
+ # results
84
+ print('input text:', text)
85
+ print('predicted text:', predicted_text)
86
+
87
+ # input text:
88
+ # predicted text:
89
+
90
+ ```
91
+
92
+ #### Generate Full Sequence
93
+ ```python
94
+ # input sequence
95
+ text = "Bu yazıyı bilgisayar yazdı."
96
+ inputs = tokenizer(text, return_tensors="pt") #need pt will be corrected to tr
97
+
98
+ # model output using Top-k sampling text generation method
99
+ sample_outputs = model.generate(inputs.input_ids,
100
+ pad_token_id=50256,
101
+ do_sample=True,
102
+ max_length=50, # put the token number you want
103
+ top_k=40,
104
+ num_return_sequences=1)
105
+
106
+ # generated sequence
107
+ for i, sample_output in enumerate(sample_outputs):
108
+ print(">> Generated text {}\n\n{}".format(i+1, tokenizer.decode(sample_output.tolist())))
109
+
110
+ # >> Generated text
111
+ #
112
+
113
+ ```
114
+
115
+ #### Limitations and bias
116
+
117
+ The training data used for this model come from Turkish Wikipedia and books. We know it contains a lot of unfiltered content from the internet, which is far from neutral. Also not much pre-processing was done on books hence chapter names and page numbers can be seen on some cases. This is a work in progress.
118
+
119
+
120
+ ## Training data
121
+
122
+ Wikipedia Turkish article dump as of 28-10-2020
123
+ Turkish book dataset of >400 classic novels
124
+
125
+ ## Training procedure
126
+
127
+
128
+ ## Eval results
129
+
130
+ | epoch |train_loss |valid_loss |accuracy |perplexity |time |
131
+ | ----- | -------- |--------- | ---------- | --------- | ----- |
132
+ |0 |4.497828 |4.549605 |0.277328 |94.595070 |2:09:58|
133
+ |1 |4.503929 |4.519456 |0.275071 |91.785645 |2:04:30|
134
+ |2 |3.612716 |3.921146 |0.344802 |50.458256 |2:03:22|
135
+ |3 |3.777645 |4.072006 |0.326130 |58.674530 |1:56:14|
136
+ |4 |2.934462 |3.801303 |0.363719 |44.759476 |1:58:55|
137
+ ```
138
+