metadata

license: mit
tags:
  - pytorch
  - sinhala
  - gpt2
model-index:
  - name: sinhala-gpt2
    results: []
widget:
  - text: මහ
  - text: සංවිධ
  - text: දුර්ලභ
  - text: තනිවීලා
  - text: ඔබ
inference:
  parameters:
    do_sample: false
    temperature: 0.2
language:
  - si

sinhala-gpt2

This particular model has undergone fine-tuning based on the gpt2 architecture, utilizing a dataset of Sinhala NEWS from various sources. Even though this version of GPT-2 has been finely tuned and is quite simple, it is still capable of generating news articles that are identical. Take, for example, the following samples(Some of them are hilarious though :D):

"ඔබ විසින් මෙම විරෝධතාව සංවිධානය කර තිබුණේ නැහැ කියලා හිටපු ජනාධිපති මහ"
"දුර්ලභ ගණයේ විශ්වවිද්යාල ප්රතිපාදන කොමිෂන් සභාවේ සභාපති මහාචාර්ය ජී එල්"

⚠️ Since the dataset used for this model is mostly composed of news articles, it is heavily biased towards generating news content. This bias may become apparent during the generation process.

Training procedure

The model was trained for approximately 12+ hours on Kaggle GPUs.

Usage Details

from transformers import AutoTokenizer, AutoModelForCausalLM,pipeline

tokenizer = AutoTokenizer.from_pretrained("Ransaka/sinhala-gpt2")
model = AutoModelForCausalLM.from_pretrained("Ransaka/sinhala-gpt2")
generator("දුර") #දුර ඈත පාසැල් වියේ පසුවූයේ මෙම සිද්ධිය සම්බන්ධයෙන් විමර්ශන සිදුකරන බවයි

or using git

git lfs install
git clone https://huggingface.co/Ransaka/sinhala-gpt2

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 16
eval_batch_size: 16
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss
2.3015	1.0	15323	2.3498
1.8582	2.0	30646	1.9921
1.5491	3.0	45969	1.9376

Framework versions

Transformers 4.26.1
Pytorch 1.13.0
Datasets 2.1.0
Tokenizers 0.13.2