🇹🇷 Turkish GPT-2 Model

In this repository I release GPT-2 model, that was trained on various texts for Turkish.

The model is meant to be an entry point for fine-tuning on other texts.

Training corpora

I used a Turkish corpora that is taken from oscar-corpus.

It was possible to create byte-level BPE with Tokenizers library of Huggingface.

With the Tokenizers library, I created a 52K byte-level BPE vocab based on the training corpora.

After creating the vocab, I could train the GPT-2 for Turkish on two 2080TI over the complete training corpus (five epochs).

Logs during training: https://tensorboard.dev/experiment/3AWKv8bBTaqcqZP5frtGkw/#scalars

Model weights

Both PyTorch and Tensorflow compatible weights are available.

Model Downloads
redrussianarmy/gpt2-turkish-cased config.jsonmerges.txtpytorch_model.binspecial_tokens_map.jsontf_model.h5tokenizer_config.jsontraning_args.binvocab.json

Using the model

The model itself can be used in this way:

from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("redrussianarmy/gpt2-turkish-cased")
model = AutoModelWithLMHead.from_pretrained("redrussianarmy/gpt2-turkish-cased")

Here's an example that shows how to use the great Transformers Pipelines for generating text:

from transformers import pipeline
pipe = pipeline('text-generation', model="redrussianarmy/gpt2-turkish-cased",
                 tokenizer="redrussianarmy/gpt2-turkish-cased", config={'max_length':800})   
text = pipe("Akşamüstü yolda ilerlerken, ")[0]["generated_text"]

How to clone the model repo?

git lfs install
git clone https://huggingface.co/redrussianarmy/gpt2-turkish-cased

Contact (Bugs, Feedback, Contribution and more)

For questions about the GPT2-Turkish model, just open an issue here 🤗

