hindi_gpt2 / README.md
NebulaByte's picture
Update README.md
6cfc1c0
|
raw
history blame
No virus
2.79 kB
metadata
license: apache-2.0
widget:
  - text: अपने अनुप्रयोग को पहुंचनीयता व्यायाम
  - text: जनतंत्र की सफलता केवल इस बात से नहीं हो सकती है कि हर
  - text: अगर इसके बाद भी वे फैसले पर कायम रहते हैं और
  - text: मामले का खुलासा होने के बाद
  - text: My name is Julien and I like to
  - text: My name is Thomas and my main
inference:
  parameters:
    max_length: 200

Model Overview:

The model is a language generation model designed for extending the GPT2 models to support Hindi language along with the original languages that it supports. It was fine-tuned on Hindi texts of wikipedia articles.

Model Architecture and Parameters:

The model architecture is based on the GPT-2 framework, specifically using the parameters of the small version of the original OpenAI GPT2 model. It employs a Byte Pair Encoding (BPE) tokenizer.

Corpus:

The training corpus for Hindi GPT2 consists of Wikipedia articles.

Tokenizer:

A tokenizer is trained on Hindi Wikipedia Corpus. The new tokenizer vocabulary (5000 tokens) is merged with existing tokenizer. Hindi GPT2 uses a byte-level version of Byte Pair Encoding (BPE) for tokenizing Hindi text, including Unicode characters. The tokenizer has a vocabulary size of 53497, which allows it to effectively represent the Hindi language's rich vocabulary. Input sequences are formed by breaking the text into consecutive tokens with a maximum length of 1024 tokens.

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

More information needed

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0005
  • train_batch_size: 64
  • eval_batch_size: 64
  • seed: 42
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 256
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 500
  • num_epochs: 1
  • mixed_precision_training: Native AMP

Training results

Step Training Loss Validation Loss
500 2.0016 1.066703
1000 1.0314 0.959653
1500 0.9593 0.918827
2000 0.922 0.889607
2500 0.8983 0.872523
3000 0.8852 0.863592

Framework versions

  • Transformers 4.30.2
  • torch 1.13.1
  • Datasets 2.13.1
  • Tokenizers 0.13.3