|
--- |
|
license: apache-2.0 |
|
mask_token: "<mask>" |
|
tags: |
|
- generated_from_trainer |
|
model-index: |
|
- name: distilbert-base-nepali |
|
results: [] |
|
widget: |
|
- text: "मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, <mask>, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।" |
|
example_title: "Example 1" |
|
- text: "अचेल विद्यालय र कलेजहरूले स्मारिका कत्तिको प्रकाशन गर्छन्, यकिन छैन । केही वर्षपहिलेसम्म गाउँसहरका सानाठूला <mask> संस्थाहरूमा पुग्दा शिक्षक वा कर्मचारीले संस्थाबाट प्रकाशित पत्रिका, स्मारिका र पुस्तक कोसेलीका रूपमा थमाउँथे ।" |
|
example_title: "Example 2" |
|
- text: "जलविद्युत् विकासको ११० वर्षको इतिहास बनाएको नेपालमा हाल सरकारी र निजी क्षेत्रबाट गरी करिब २ हजार मेगावाट <mask> उत्पादन भइरहेको छ ।" |
|
example_title: "Example 3" |
|
--- |
|
|
|
# distilbert-base-nepali |
|
|
|
This model is pre-trained on [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) dataset consisting of over 13 million Nepali text sequences using a masked language modeling (MLM) objective. Our approach trains a Sentence Piece Model (SPM) for text tokenization similar to [XLM-ROBERTa](https://arxiv.org/abs/1911.02116) and trains [distilbert model](https://arxiv.org/abs/1910.01108) for language modeling. |
|
|
|
It achieves the following results on the evaluation set: |
|
|
|
mlm probability|evaluation loss|evaluation perplexity |
|
--:|----:|-----:| |
|
15%|2.439|11.459| |
|
20%|2.605|13.351| |
|
|
|
## Model description |
|
|
|
Refer to original [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) |
|
|
|
## Intended uses & limitations |
|
|
|
This backbone model intends to be fine-tuned on Nepali language focused downstream task such as sequence classification, token classification or question answering. |
|
The language model being trained on a data with texts grouped to a block size of 512, it handles text sequence up to 512 tokens and may not perform satisfactorily on shorter sequences. |
|
|
|
## Usage |
|
|
|
This model can be used directly with a pipeline for masked language modeling: |
|
|
|
```python |
|
>>> from transformers import pipeline |
|
>>> unmasker = pipeline('fill-mask', model='Sakonii/distilbert-base-nepali') |
|
>>> unmasker("मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, <mask>, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।") |
|
|
|
[{'score': 0.04128897562623024, |
|
'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, मौसम, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।', |
|
'token': 2605, |
|
'token_str': 'मौसम'}, |
|
{'score': 0.04100276157259941, |
|
'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, प्रकृति, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।', |
|
'token': 2792, |
|
'token_str': 'प्रकृति'}, |
|
{'score': 0.026525357738137245, |
|
'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, पानी, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।', |
|
'token': 387, |
|
'token_str': 'पानी'}, |
|
{'score': 0.02340106852352619, |
|
'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, जल, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।', |
|
'token': 1313, |
|
'token_str': 'जल'}, |
|
{'score': 0.02055591531097889, |
|
'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, वातावरण, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।', |
|
'token': 790, |
|
'token_str': 'वातावरण'}] |
|
``` |
|
|
|
Here is how we can use the model to get the features of a given text in PyTorch: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('Sakonii/distilbert-base-nepali') |
|
model = AutoModelForMaskedLM.from_pretrained('Sakonii/distilbert-base-nepali') |
|
|
|
# prepare input |
|
text = "चाहिएको text यता राख्नु होला।" |
|
encoded_input = tokenizer(text, return_tensors='pt') |
|
|
|
# forward pass |
|
output = model(**encoded_input) |
|
``` |
|
|
|
## Training data |
|
|
|
This model is trained on [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) language modeling dataset which combines the datasets: [OSCAR](https://huggingface.co/datasets/oscar) , [cc100](https://huggingface.co/datasets/cc100) and a set of scraped Nepali articles on Wikipedia. |
|
As for training the language model, the texts in the training set are grouped to a block of 512 tokens. |
|
|
|
## Tokenization |
|
|
|
A Sentence Piece Model (SPM) is trained on a subset of [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) dataset for text tokenization. The tokenizer trained with vocab-size=24576, min-frequency=4, limit-alphabet=1000 and model-max-length=512. |
|
|
|
## Training procedure |
|
|
|
The model is trained with the same configuration as the original [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased); 512 tokens per instance, 28 instances per batch, and around 35.7K training steps. |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used for training of the final epoch: [ Refer to the *Training results* table below for varying hyperparameters every epoch ] |
|
- learning_rate: 5e-05 |
|
- train_batch_size: 28 |
|
- eval_batch_size: 8 |
|
- seed: 42 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: linear |
|
- num_epochs: 1 |
|
- mixed_precision_training: Native AMP |
|
|
|
### Training results |
|
|
|
The model is trained for 4 epochs with varying hyperparameters: |
|
|
|
| Training Loss | Epoch | MLM Probability | Train Batch Size | Step | Validation Loss | Perplexity | |
|
|:-------------:|:-----:|:---------------:|:----------------:|:-----:|:---------------:|:----------:| |
|
| 3.4477 | 1.0 | 15 | 26 | 38864 | 3.3067 | 27.2949 | |
|
| 2.9451 | 2.0 | 15 | 28 | 35715 | 2.8238 | 16.8407 | |
|
| 2.866 | 3.0 | 20 | 28 | 35715 | 2.7431 | 15.5351 | |
|
| 2.7287 | 4.0 | 20 | 28 | 35715 | 2.6053 | 13.5353 | |
|
|
|
Final model evaluated with MLM Probability of 15%: |
|
|
|
| Training Loss | Epoch | MLM Probability | Train Batch Size | Step | Validation Loss | Perplexity | |
|
|:-------------:|:-----:|:---------------:|:----------------:|:-----:|:---------------:|:----------:| |
|
| - | - | 15 | - | - | 2.4388 | 11.4589 | |
|
|
|
|
|
### Framework versions |
|
|
|
- Transformers 4.16.2 |
|
- Pytorch 1.9.1 |
|
- Datasets 1.18.3 |
|
- Tokenizers 0.10.3 |
|
|