gorkemgoknar
/

gpt2-small-turkish

@@ -1,19 +1,147 @@
-gpt2-turkish-wiki
-Current version is demo only with some trained wikipedia text in Turkish.
-Using modified   https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2_FAST.ipynb
-Inference is not so good at the moment.
-Epoch	train_loss	valid_loss	accuracy	perplexity	time
-0	4.373726	5.398773	0.264228	221.134857	02:56
-1	4.264910	5.344171	0.267870	209.384140	02:54
-TODO: Total turkish wikipedia text is 3 GB xml file
-1 epoch training on full wikipedia turkish gave some good results, will update here when have full model
-epoch	train_loss	valid_loss	accuracy	perplexity	time
-0	3.948997	4.001249	0.330571	54.666405	2:41:54

+---
+language:
+- tr
+thumbnail:
+tags:
+- gpt2
+- turkish
+license: Apache 2.0
+datasets:
+- wikipedia-turkish
+metrics:
+- perplexity
+- accuracy
+widget:
+- text: "Bu yazıyı bir bilgisayar yazdı. Yazarken"
+  context: ""
+- text: "İnternete kolay erişim sayesinde dünya daha da küçüldü. Bunun sonucunda"
+  context: ""
+---
+# MyModel
+## Model description
+This is a GPT2-Small English based model finetuned and additionaly trainied with Wikipedia Articles in Turkish as of 28-10-2020
+Work has been done on Pierre Guillou tutorial as on this page.
+(https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb)
+Code is converted to work with Fastai 2.X .
+Using Google Colab for training.
+Additional tutorial and source will be in https://github.com/gorkemgoknar in later stage.
+Current accuracy 28.9 %  , Perplexity : 86.71
+Models are available:
+* [gpt2-small-tuned-tr] (https://huggingface.co/gorkemgoknar/gpt2-small-turkish)
+## Intended uses & limitations
+#### How to use
+#### Install
+```python
+from transformers import AutoTokenizer, AutoModelWithLMHead
+import torch
+tokenizer = AutoTokenizer.from_pretrained("gorkemgoknar/gpt2-small-turkish")
+model = AutoModelWithLMHead.from_pretrained("gorkemgoknar/gpt2-small-turkish")
+# Get sequence length max of 1024
+tokenizer.model_max_length=1024
+model.eval()  # disable dropout (or leave in train mode to finetune)
+```
+#### Generate 1 word
+```python
+# input sequence
+text = "Bu yazıyı bilgisayar yazdı."
+inputs = tokenizer(text, return_tensors="pt")
+# model output
+outputs = model(**inputs, labels=inputs["input_ids"])
+loss, logits = outputs[:2]
+predicted_index = torch.argmax(logits[0, -1, :]).item()
+predicted_text = tokenizer.decode([predicted_index])
+# results
+print('input text:', text)
+print('predicted text:', predicted_text)
+# input text: Quem era Jim Henson? Jim Henson era um
+# predicted text:  homem
+```
+#### Generate Full Sequence
+```python
+# input sequence
+text = "Bu yazıyı bilgisayar yazdı."
+inputs = tokenizer(text, return_tensors="pt")
+# model output using Top-k sampling text generation method
+sample_outputs = model.generate(inputs.input_ids,
+                                pad_token_id=50256,
+                                do_sample=True,
+                                max_length=50, # put the token number you want
+                                top_k=40,
+                                num_return_sequences=1)
+# generated sequence
+for i, sample_output in enumerate(sample_outputs):
+    print(">> Generated text {}\n\n{}".format(i+1, tokenizer.decode(sample_output.tolist())))
+# >> Generated text
+# Quem era Jim Henson? Jim Henson era um executivo de televisão e diretor de um grande estúdio de cinema mudo chamado Selig,
+# depois que o diretor de cinema mudo Georges Seuray dirigiu vários filmes para a Columbia e o estúdio.
+```
+#### Limitations and bias
+The training data used for this model come from Turkish Wikipedia. We know it contains a lot of unfiltered content from the internet, which is far from neutral.
+## Training data
+Wikipedia Turkish article dump as of 28-10-2020
+## Training procedure
+## Eval results
+#epoch	train_loss	valid_loss	accuracy	perplexity	time
+#0	6.922922	6.653488	0.148002	775.484253	2:26:41    (freeze last 1)
+#1	4.799396	4.633522	0.277028	102.875755	3:03:38    (freeze last 1)
+#2	4.610025	4.462641	0.289884	86.716248	2:34:50      (freeze last 2)
+### BibTeX entry and citation info
+```bibtex
+@misc{gorkemgoknar,
+    author = {{Gorkem Goknar}},
+    title = {{Kina sea urchin regions in NZ}},
+    howpublished = {\url{http://fs.fish.govt.nz/Page.aspx?pk=7\&sc=SUR}},
+    note = {Online; accessed 29 January 2014}
+@inproceedings{...,
+  year={2020},
+  title={Facebook FAIR's WMT19 News Translation Task Submission},
+  author={Ng, Nathan and Yee, Kyra and Baevski, Alexei and Ott, Myle and Auli, Michael and Edunov, Sergey},
+  booktitle={Proc. of WMT},
+}
+```