gorkemgoknar commited on
Commit
3e96cae
1 Parent(s): cfe96f6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +139 -11
README.md CHANGED
@@ -1,19 +1,147 @@
1
- gpt2-turkish-wiki
 
 
 
 
 
 
2
 
3
- Current version is demo only with some trained wikipedia text in Turkish.
 
 
 
 
 
4
 
5
- Using modified https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2_FAST.ipynb
 
 
 
 
 
 
6
 
7
- Inference is not so good at the moment.
8
 
9
- Epoch train_loss valid_loss accuracy perplexity time
10
- 0 4.373726 5.398773 0.264228 221.134857 02:56
11
- 1 4.264910 5.344171 0.267870 209.384140 02:54
12
 
 
13
 
14
- TODO: Total turkish wikipedia text is 3 GB xml file
 
15
 
16
- 1 epoch training on full wikipedia turkish gave some good results, will update here when have full model
17
- epoch train_loss valid_loss accuracy perplexity time
18
- 0 3.948997 4.001249 0.330571 54.666405 2:41:54
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
 
1
+ ---
2
+ language:
3
+ - tr
4
+ thumbnail:
5
+ tags:
6
+ - gpt2
7
+ - turkish
8
 
9
+ license: Apache 2.0
10
+ datasets:
11
+ - wikipedia-turkish
12
+ metrics:
13
+ - perplexity
14
+ - accuracy
15
 
16
+ widget:
17
+ - text: "Bu yazıyı bir bilgisayar yazdı. Yazarken"
18
+ context: ""
19
+ - text: "İnternete kolay erişim sayesinde dünya daha da küçüldü. Bunun sonucunda"
20
+ context: ""
21
+
22
+ ---
23
 
24
+ # MyModel
25
 
26
+ ## Model description
 
 
27
 
28
+ This is a GPT2-Small English based model finetuned and additionaly trainied with Wikipedia Articles in Turkish as of 28-10-2020
29
 
30
+ Work has been done on Pierre Guillou tutorial as on this page.
31
+ (https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb)
32
 
33
+ Code is converted to work with Fastai 2.X .
34
+
35
+ Using Google Colab for training.
36
+
37
+ Additional tutorial and source will be in https://github.com/gorkemgoknar in later stage.
38
+
39
+ Current accuracy 28.9 % , Perplexity : 86.71
40
+
41
+ Models are available:
42
+
43
+ * [gpt2-small-tuned-tr] (https://huggingface.co/gorkemgoknar/gpt2-small-turkish)
44
+
45
+ ## Intended uses & limitations
46
+
47
+ #### How to use
48
+
49
+ #### Install
50
+
51
+ ```python
52
+ from transformers import AutoTokenizer, AutoModelWithLMHead
53
+ import torch
54
+
55
+ tokenizer = AutoTokenizer.from_pretrained("gorkemgoknar/gpt2-small-turkish")
56
+ model = AutoModelWithLMHead.from_pretrained("gorkemgoknar/gpt2-small-turkish")
57
+
58
+ # Get sequence length max of 1024
59
+ tokenizer.model_max_length=1024
60
+
61
+ model.eval() # disable dropout (or leave in train mode to finetune)
62
+
63
+ ```
64
+
65
+ #### Generate 1 word
66
+ ```python
67
+ # input sequence
68
+ text = "Bu yazıyı bilgisayar yazdı."
69
+ inputs = tokenizer(text, return_tensors="pt")
70
+
71
+ # model output
72
+ outputs = model(**inputs, labels=inputs["input_ids"])
73
+ loss, logits = outputs[:2]
74
+ predicted_index = torch.argmax(logits[0, -1, :]).item()
75
+ predicted_text = tokenizer.decode([predicted_index])
76
+
77
+ # results
78
+ print('input text:', text)
79
+ print('predicted text:', predicted_text)
80
+
81
+ # input text: Quem era Jim Henson? Jim Henson era um
82
+ # predicted text: homem
83
+
84
+ ```
85
+
86
+ #### Generate Full Sequence
87
+ ```python
88
+ # input sequence
89
+ text = "Bu yazıyı bilgisayar yazdı."
90
+ inputs = tokenizer(text, return_tensors="pt")
91
+
92
+ # model output using Top-k sampling text generation method
93
+ sample_outputs = model.generate(inputs.input_ids,
94
+ pad_token_id=50256,
95
+ do_sample=True,
96
+ max_length=50, # put the token number you want
97
+ top_k=40,
98
+ num_return_sequences=1)
99
+
100
+ # generated sequence
101
+ for i, sample_output in enumerate(sample_outputs):
102
+ print(">> Generated text {}\n\n{}".format(i+1, tokenizer.decode(sample_output.tolist())))
103
+
104
+ # >> Generated text
105
+ # Quem era Jim Henson? Jim Henson era um executivo de televisão e diretor de um grande estúdio de cinema mudo chamado Selig,
106
+ # depois que o diretor de cinema mudo Georges Seuray dirigiu vários filmes para a Columbia e o estúdio.
107
+
108
+ ```
109
+
110
+ #### Limitations and bias
111
+
112
+ The training data used for this model come from Turkish Wikipedia. We know it contains a lot of unfiltered content from the internet, which is far from neutral.
113
+
114
+
115
+ ## Training data
116
+
117
+ Wikipedia Turkish article dump as of 28-10-2020
118
+
119
+ ## Training procedure
120
+
121
+
122
+ ## Eval results
123
+
124
+ #epoch train_loss valid_loss accuracy perplexity time
125
+ #0 6.922922 6.653488 0.148002 775.484253 2:26:41 (freeze last 1)
126
+ #1 4.799396 4.633522 0.277028 102.875755 3:03:38 (freeze last 1)
127
+ #2 4.610025 4.462641 0.289884 86.716248 2:34:50 (freeze last 2)
128
+
129
+
130
+
131
+ ### BibTeX entry and citation info
132
+
133
+ ```bibtex
134
+ @misc{gorkemgoknar,
135
+ author = {{Gorkem Goknar}},
136
+ title = {{Kina sea urchin regions in NZ}},
137
+ howpublished = {\url{http://fs.fish.govt.nz/Page.aspx?pk=7\&sc=SUR}},
138
+ note = {Online; accessed 29 January 2014}
139
+
140
+ @inproceedings{...,
141
+ year={2020},
142
+ title={Facebook FAIR's WMT19 News Translation Task Submission},
143
+ author={Ng, Nathan and Yee, Kyra and Baevski, Alexei and Ott, Myle and Auli, Michael and Edunov, Sergey},
144
+ booktitle={Proc. of WMT},
145
+ }
146
+ ```
147