
4.22 kB
--- | |
language: | |
- tr | |
thumbnail: | |
tags: | |
- gpt2 | |
- turkish | |
- aiwriter | |
- finetuned | |
license: apache-2.0 | |
datasets: | |
- wikipedia-turkish | |
- custom-book-corpus | |
metrics: | |
- perplexity | |
- accuracy | |
widget: | |
- text: Bir zaman topu olan ama köpeği olmayan bir çocuk vardı. Parkta | |
context: '' | |
- text: 'Uzun uzun sahile doğru baktı. Düşündüklerinden ' | |
context: '' | |
- text: Çok uzun zaman önce galaksinin uzak bir köşesinde... | |
context: '' | |
- text: "'Bugün kendimi çok hasta hissediyorum' dedi. Karşısında " | |
context: '' | |
--- | |
# Turkish AI Writer based on GPT2-Small | |
# Türkçe Yapay Zeka Yazarı | |
## Model description | |
This model is enhanced version of gpt2-small-turkish finetuned version. In addition to 28-10-2020 Wikipedia Turkish article dump this model is trained with more than 400 classic novels and plays in Turkish (Including Dostoyevski, Shaekspeare, Dumas) | |
Base work has been done on Pierre Guillou tutorial as on this page. | |
(https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb) | |
Note that Since Turkish language is not close to English as in Porteguese instead of training last 2 layers, last 3 layers are trained. | |
Code is converted to work with Fastai 2.X . | |
Using Google Colab for training. | |
Current accuracy 36.3 % , Perplexity : 44.75 | |
Demo (using CPU inference) is available on: http://www.metayazar.com | |
Models are available: | |
* [gpt2-small-tuned-tr] (https://huggingface.co/gorkemgoknar/gpt2-small-turkish) | |
* [gpt2-small-turkish-writer] (https://huggingface.co/gorkemgoknar/gpt2-turkish-writer) | |
## Intended uses & limitations | |
#### How to use | |
#### Install | |
```python | |
from transformers import AutoTokenizer, AutoModelWithLMHead | |
import torch | |
tokenizer = AutoTokenizer.from_pretrained("gorkemgoknar/gpt2-turkish-writer") | |
model = AutoModelWithLMHead.from_pretrained("gorkemgoknar/gpt2-turkish-writer") | |
# Get sequence length max of 1024 | |
tokenizer.model_max_length=1024 | |
model.eval() # disable dropout (or leave in train mode to finetune) | |
``` | |
#### Generate 1 word | |
```python | |
# input sequence | |
text = "Bu yazıyı bilgisayar yazdı." | |
inputs = tokenizer(text, return_tensors="pt") | |
# model output | |
outputs = model(**inputs, labels=inputs["input_ids"]) | |
loss, logits = outputs[:2] | |
predicted_index = torch.argmax(logits[0, -1, :]).item() | |
predicted_text = tokenizer.decode([predicted_index]) | |
# results | |
print('input text:', text) | |
print('predicted text:', predicted_text) | |
# input text: | |
# predicted text: | |
``` | |
#### Generate Full Sequence | |
```python | |
# input sequence | |
text = "Bu yazıyı bilgisayar yazdı." | |
inputs = tokenizer(text, return_tensors="pt") | |
# model output using Top-k sampling text generation method | |
sample_outputs = model.generate(inputs.input_ids, | |
pad_token_id=50256, | |
do_sample=True, | |
max_length=50, # put the token number you want | |
top_k=40, | |
num_return_sequences=1) | |
# generated sequence | |
for i, sample_output in enumerate(sample_outputs): | |
print(">> Generated text {}\n\n{}".format(i+1, tokenizer.decode(sample_output.tolist()))) | |
# >> Generated text | |
# | |
``` | |
#### Limitations and bias | |
The training data used for this model come from Turkish Wikipedia and books. We know it contains a lot of unfiltered content from the internet, which is far from neutral. Also not much pre-processing was done on books hence chapter names and page numbers can be seen on some cases. This is a work in progress. | |
## Training data | |
Wikipedia Turkish article dump as of 28-10-2020 | |
Turkish book dataset of >400 classic novels | |
## Training procedure | |
## Eval results | |
| epoch |train_loss |valid_loss |accuracy |perplexity |time | | |
| ----- | -------- |--------- | ---------- | --------- | ----- | | |
|0 |4.497828 |4.549605 |0.277328 |94.595070 |2:09:58| | |
|1 |4.503929 |4.519456 |0.275071 |91.785645 |2:04:30| | |
|2 |3.612716 |3.921146 |0.344802 |50.458256 |2:03:22| | |
|3 |3.777645 |4.072006 |0.326130 |58.674530 |1:56:14| | |
|4 |2.934462 |3.801303 |0.363719 |44.759476 |1:58:55| | |
Note: 1cycle rule training is used and epochs are at different times | |
``` | |