gpt2-small-turkish / README.md

elishowk

Automatic correction of README.md metadata for keys. Contact website@huggingface.co for any question

cecfcbf almost 3 years ago

preview code

raw

history blame contribute delete

No virus

3.8 kB

	---
	language:
	- tr
	thumbnail:
	tags:
	- gpt2
	- turkish

	license: apache-2.0
	datasets:
	- wikipedia-turkish
	metrics:
	- perplexity
	- accuracy

	widget:
	- text: Bu yazıyı bir bilgisayar yazdı. Yazarken
	context: ''
	- text: İnternete kolay erişim sayesinde dünya daha da küçüldü. Bunun sonucunda
	context: ''
	---

	# Turkish GPT2 Model Finetuned
	# Türkçe GPT2 Modeli

	## Model description

	This is a GPT2-Small English based model finetuned and additionaly trainied with Wikipedia Articles in Turkish as of 28-10-2020

	Live demo based on this work at : https://www.metayazar.com/

	Fine tuned writer on this model: https://huggingface.co/gorkemgoknar/gpt2-turkish-writer

	Work has been done on Pierre Guillou tutorial as on this page.
	(https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb)

	Code is converted to work with Fastai 2.X .

	Using Google Colab for training.

	Additional tutorial and source will be in https://github.com/gorkemgoknar in later stage.

	Current accuracy 33 % , Perplexity : 51.88

	Models are available:

	* [gpt2-small-tuned-tr] (https://huggingface.co/gorkemgoknar/gpt2-small-turkish)
	* [gpt2-small-turkish-writer] (https://huggingface.co/gorkemgoknar/gpt2-turkish-writer)

	## Intended uses & limitations

	#### How to use

	#### Install

	```python
	from transformers import AutoTokenizer, AutoModelWithLMHead
	import torch

	tokenizer = AutoTokenizer.from_pretrained("gorkemgoknar/gpt2-small-turkish")
	model = AutoModelWithLMHead.from_pretrained("gorkemgoknar/gpt2-small-turkish")

	# Get sequence length max of 1024
	tokenizer.model_max_length=1024

	model.eval() # disable dropout (or leave in train mode to finetune)

	```

	#### Generate 1 word
	```python
	# input sequence
	text = "Bu yazıyı bilgisayar yazdı."
	inputs = tokenizer(text, return_tensors="pt")

	# model output
	outputs = model(**inputs, labels=inputs["input_ids"])
	loss, logits = outputs[:2]
	predicted_index = torch.argmax(logits[0, -1, :]).item()
	predicted_text = tokenizer.decode([predicted_index])

	# results
	print('input text:', text)
	print('predicted text:', predicted_text)

	# input text:
	# predicted text:

	```

	#### Generate Full Sequence
	```python
	# input sequence
	text = "Bu yazıyı bilgisayar yazdı."
	inputs = tokenizer(text, return_tensors="pt")

	# model output using Top-k sampling text generation method
	sample_outputs = model.generate(inputs.input_ids,
	pad_token_id=50256,
	do_sample=True,
	max_length=50, # put the token number you want
	top_k=40,
	num_return_sequences=1)

	# generated sequence
	for i, sample_output in enumerate(sample_outputs):
	print(">> Generated text {}\\\\
	\\\\
	{}".format(i+1, tokenizer.decode(sample_output.tolist())))

	# >> Generated text
	#

	```

	#### Limitations and bias

	The training data used for this model come from Turkish Wikipedia. We know it contains a lot of unfiltered content from the internet, which is far from neutral.


	## Training data

	Wikipedia Turkish article dump as of 28-10-2020

	## Training procedure


	## Eval results

	\| epoch\\\\t\|train_loss\\\\t\|valid_loss\\\\t\|accuracy\\\\t\|perplexity\\\\t\|time \|
	\| ----- \| -------- \|--------- \| ---------- \| --------- \| ----- \|
	\|0\\\\t\|4.777015\\\\t\|4.621834\\\\t\|0.292547\\\\t\|101.680367\\\\t\|2:42:05\|
	\|1\\\\t\|4.509412\\\\t\|4.403999\\\\t\|0.305574\\\\t\|81.777267\\\\t\|1:09:38\|
	\|2\\\\t\|4.169529\\\\t\|4.120755\\\\t\|0.324908\\\\t\|61.605747\\\\t\|1:07:45\|
	\|3\\\\t\|4.293973\\\\t\|4.177899\\\\t\|0.317211\\\\t\|65.228653\\\\t\|1:07:02\|
	\|4\\\\t\|4.049848\\\\t\|3.949103\\\\t\|0.338347\\\\t\|51.888783\\\\t\|1:05:53\|

	#Epoch 0 on Tesla T4, others on V100

	```