metadata
license: mit
tags:
- pytorch
- gpt2
model-index:
- name: sinhala-gpt2
results: []
widget:
- text: මහ
- text: සංවිධ
- text: දුර්ලභ
- text: තනිවීලා
- text: ඔබ
inference:
parameters:
do_sample: false
temperature: 0.2
max_new_tokens: 100
language:
- si
sinhala-gpt2
This particular model has undergone fine-tuning based on the gpt2 architecture, utilizing a dataset of Sinhala NEWS from various sources. Even though this is quite simple to train, it is still capable of generating news articles that are identical. Take, for example, the following samples(Some of them are hilarious though :D):
- "ඔබ විසින් මෙම විරෝධතාව සංවිධානය කර තිබුණේ නැහැ කියලා හිටපු ජනාධිපති මහ"
- "දුර්ලභ ගණයේ විශ්වවිද්යාල ප්රතිපාදන කොමිෂන් සභාවේ සභාපති මහාචාර්ය ජී එල්"
⚠️ Since the dataset used for this model is mostly composed of news articles, it is heavily biased toward generating news content. This bias may become apparent during the generation process.
Training procedure
The model was trained for 12+ hours on Kaggle GPUs.
Usage Details
from transformers import AutoTokenizer, AutoModelForCausalLM,pipeline
tokenizer = AutoTokenizer.from_pretrained("Ransaka/sinhala-gpt2")
model = AutoModelForCausalLM.from_pretrained("Ransaka/sinhala-gpt2")
generator("දුර") #දුර ඈත පාසැල් වියේ පසුවූයේ මෙම සිද්ධිය සම්බන්ධයෙන් විමර්ශන සිදුකරන බවයි
or using git
git lfs install
git clone https://huggingface.co/Ransaka/sinhala-gpt2
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 3
Training results
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
2.0233 | 1.0 | 15323 | 2.3348 |
1.6938 | 2.0 | 30646 | 1.8377 |
1.4938 | 3.0 | 45969 | 1.6498 |
Framework versions
- Transformers 4.26.1
- Pytorch 1.13.0
- Datasets 2.1.0
- Tokenizers 0.13.2