metadata

license: apache-2.0
widget:
  - text: अपने अनुप्रयोग को पहुंचनीयता व्यायाम
  - text: जनतंत्र की सफलता केवल इस बात से नहीं हो सकती है कि हर
  - text: अगर इसके बाद भी वे फैसले पर कायम रहते हैं और
  - text: मामले का खुलासा होने के बाद
  - text: My name is Julien and I like to
  - text: My name is Thomas and my main
inference:
  parameters:
    max_length: 200

Model Overview:

The model is a language generation model designed for extending the GPT2 models to support Hindi language along with the original languages that it supports. It was fine-tuned on Hindi texts of wikipedia articles.

Model Architecture and Parameters:

The model architecture is based on the GPT-2 framework, specifically using the parameters of the small version of the original OpenAI GPT2 model. It employs a Byte Pair Encoding (BPE) tokenizer.

Corpus:

The training corpus for Hindi GPT2 consists of Wikipedia articles.

Tokenizer:

A tokenizer is trained on Hindi Wikipedia Corpus. The new tokenizer vocabulary (5000 tokens) is merged with existing tokenizer. Hindi GPT2 uses a byte-level version of Byte Pair Encoding (BPE) for tokenizing Hindi text, including Unicode characters. The tokenizer has a vocabulary size of 53497, which allows it to effectively represent the Hindi language's rich vocabulary. Input sequences are formed by breaking the text into consecutive tokens with a maximum length of 1024 tokens.

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

More information needed

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0005
train_batch_size: 64
eval_batch_size: 64
seed: 42
gradient_accumulation_steps: 4
total_train_batch_size: 256
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 500
num_epochs: 1
mixed_precision_training: Native AMP

Training results

Step	Training Loss	Validation Loss
500	2.0016	1.066703
1000	1.0314	0.959653
1500	0.9593	0.918827
2000	0.922	0.889607
2500	0.8983	0.872523
3000	0.8852	0.863592

Framework versions

Transformers 4.30.2
torch 1.13.1
Datasets 2.13.1
Tokenizers 0.13.3