Edit model card

distilbert-base-nepali

This model is pre-trained on nepalitext dataset consisting of over 13 million Nepali text sequences using a masked language modeling (MLM) objective. Our approach trains a Sentence Piece Model (SPM) for text tokenization similar to XLM-ROBERTa and trains distilbert model for language modeling. Find more details in this paper.

It achieves the following results on the evaluation set:

mlm probability evaluation loss evaluation perplexity
15% 2.349 10.479
20% 2.605 13.351

Model description

Refer to original distilbert-base-uncased

Intended uses & limitations

This backbone model intends to be fine-tuned on Nepali language focused downstream task such as sequence classification, token classification or question answering. The language model being trained on a data with texts grouped to a block size of 512, it handles text sequence up to 512 tokens and may not perform satisfactorily on shorter sequences.

Usage

This model can be used directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='Sakonii/distilbert-base-nepali')
>>> unmasker("मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, <mask>, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।")

[{'score': 0.04128897562623024,
  'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, मौसम, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
  'token': 2605,
  'token_str': 'मौसम'},
 {'score': 0.04100276157259941,
  'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, प्रकृति, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
  'token': 2792,
  'token_str': 'प्रकृति'},
 {'score': 0.026525357738137245,
  'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, पानी, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
  'token': 387,
  'token_str': 'पानी'},
 {'score': 0.02340106852352619,
  'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, जल, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
  'token': 1313,
  'token_str': 'जल'},
 {'score': 0.02055591531097889,
  'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, वातावरण, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
  'token': 790,
  'token_str': 'वातावरण'}]

Here is how we can use the model to get the features of a given text in PyTorch:

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained('Sakonii/distilbert-base-nepali')
model = AutoModelForMaskedLM.from_pretrained('Sakonii/distilbert-base-nepali')

# prepare input
text = "चाहिएको text यता राख्नु होला।"
encoded_input = tokenizer(text, return_tensors='pt')

# forward pass
output = model(**encoded_input)

Training data

This model is trained on nepalitext language modeling dataset which combines the datasets: OSCAR , cc100 and a set of scraped Nepali articles on Wikipedia. As for training the language model, the texts in the training set are grouped to a block of 512 tokens.

Tokenization

A Sentence Piece Model (SPM) is trained on a subset of nepalitext dataset for text tokenization. The tokenizer trained with vocab-size=24576, min-frequency=4, limit-alphabet=1000 and model-max-length=512.

Training procedure

The model is trained with the same configuration as the original distilbert-base-uncased; 512 tokens per instance, 28 instances per batch, and around 35.7K training steps.

Training hyperparameters

The following hyperparameters were used for training of the final epoch: [ Refer to the Training results table below for varying hyperparameters every epoch ]

  • learning_rate: 5e-05
  • train_batch_size: 28
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 1
  • mixed_precision_training: Native AMP

Training results

The model is trained for 4 epochs with varying hyperparameters:

Training Loss Epoch MLM Probability Train Batch Size Step Validation Loss Perplexity
3.4477 1.0 15 26 38864 3.3067 27.2949
2.9451 2.0 15 28 35715 2.8238 16.8407
2.866 3.0 20 28 35715 2.7431 15.5351
2.7287 4.0 20 28 35715 2.6053 13.5353
2.6412 5.0 20 28 35715 2.5161 12.3802

Final model evaluated with MLM Probability of 15%:

Training Loss Epoch MLM Probability Train Batch Size Step Validation Loss Perplexity
- - 15 - - 2.3494 10.4791

Framework versions

  • Transformers 4.16.2
  • Pytorch 1.9.1
  • Datasets 1.18.3
  • Tokenizers 0.10.3
Downloads last month
59
Safetensors
Model size
67M params
Tensor type
F32
·

Dataset used to train Sakonii/distilbert-base-nepali