--- language: - tr thumbnail: tags: - gpt2 - turkish license: Apache 2.0 datasets: - wikipedia-turkish metrics: - perplexity - accuracy widget: - text: "Bu yazıyı bir bilgisayar yazdı. Yazarken" context: "" - text: "İnternete kolay erişim sayesinde dünya daha da küçüldü. Bunun sonucunda" context: "" --- # MyModel ## Model description This is a GPT2-Small English based model finetuned and additionaly trainied with Wikipedia Articles in Turkish as of 28-10-2020 Work has been done on Pierre Guillou tutorial as on this page. (https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb) Code is converted to work with Fastai 2.X . Using Google Colab for training. Additional tutorial and source will be in https://github.com/gorkemgoknar in later stage. Current accuracy 28.9 % , Perplexity : 86.71 Models are available: * [gpt2-small-tuned-tr] (https://huggingface.co/gorkemgoknar/gpt2-small-turkish) ## Intended uses & limitations #### How to use #### Install ```python from transformers import AutoTokenizer, AutoModelWithLMHead import torch tokenizer = AutoTokenizer.from_pretrained("gorkemgoknar/gpt2-small-turkish") model = AutoModelWithLMHead.from_pretrained("gorkemgoknar/gpt2-small-turkish") # Get sequence length max of 1024 tokenizer.model_max_length=1024 model.eval() # disable dropout (or leave in train mode to finetune) ``` #### Generate 1 word ```python # input sequence text = "Bu yazıyı bilgisayar yazdı." inputs = tokenizer(text, return_tensors="pt") # model output outputs = model(**inputs, labels=inputs["input_ids"]) loss, logits = outputs[:2] predicted_index = torch.argmax(logits[0, -1, :]).item() predicted_text = tokenizer.decode([predicted_index]) # results print('input text:', text) print('predicted text:', predicted_text) # input text: Quem era Jim Henson? Jim Henson era um # predicted text: homem ``` #### Generate Full Sequence ```python # input sequence text = "Bu yazıyı bilgisayar yazdı." inputs = tokenizer(text, return_tensors="pt") # model output using Top-k sampling text generation method sample_outputs = model.generate(inputs.input_ids, pad_token_id=50256, do_sample=True, max_length=50, # put the token number you want top_k=40, num_return_sequences=1) # generated sequence for i, sample_output in enumerate(sample_outputs): print(">> Generated text {}\n\n{}".format(i+1, tokenizer.decode(sample_output.tolist()))) # >> Generated text # Quem era Jim Henson? Jim Henson era um executivo de televisão e diretor de um grande estúdio de cinema mudo chamado Selig, # depois que o diretor de cinema mudo Georges Seuray dirigiu vários filmes para a Columbia e o estúdio. ``` #### Limitations and bias The training data used for this model come from Turkish Wikipedia. We know it contains a lot of unfiltered content from the internet, which is far from neutral. ## Training data Wikipedia Turkish article dump as of 28-10-2020 ## Training procedure ## Eval results epoch train_loss valid_loss accuracy perplexity time 0 6.922922 6.653488 0.148002 775.484253 2:26:41 1 4.799396 4.633522 0.277028 102.875755 3:03:38 2 4.610025 4.462641 0.289884 86.716248 2:34:50 ```