Update README.md

e212baa about 3 years ago

9.22 kB

	---
	license: apache-2.0
	mask_token: "<mask>"
	tags:
	- generated_from_trainer
	model-index:
	- name: distilbert-base-nepali
	results: []
	widget:
	- text: "मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, <mask>, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।"
	example_title: "Example 1"
	- text: "अचेल विद्यालय र कलेजहरूले स्मारिका कत्तिको प्रकाशन गर्छन्, यकिन छैन । केही वर्षपहिलेसम्म गाउँसहरका सानाठूला <mask> संस्थाहरूमा पुग्दा शिक्षक वा कर्मचारीले संस्थाबाट प्रकाशित पत्रिका, स्मारिका र पुस्तक कोसेलीका रूपमा थमाउँथे ।"
	example_title: "Example 2"
	- text: "जलविद्युत् विकासको ११० वर्षको इतिहास बनाएको नेपालमा हाल सरकारी र निजी क्षेत्रबाट गरी करिब २ हजार मेगावाट <mask> उत्पादन भइरहेको छ ।"
	example_title: "Example 3"
	---

	# distilbert-base-nepali

	This model is pre-trained on [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) dataset consisting of over 13 million Nepali text sequences using a masked language modeling (MLM) objective. Our approach trains a Sentence Piece Model (SPM) for text tokenization similar to [XLM-ROBERTa](https://arxiv.org/abs/1911.02116) and trains [distilbert model](https://arxiv.org/abs/1910.01108) for language modeling.

	It achieves the following results on the evaluation set:

	mlm probability\|evaluation loss\|evaluation perplexity
	--:\|----:\|-----:\|
	15%\|2.439\|11.459\|
	20%\|2.605\|13.351\|

	## Model description

	Refer to original [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased)

	## Intended uses & limitations

	This backbone model intends to be fine-tuned on Nepali language focused downstream task such as sequence classification, token classification or question answering.
	The language model being trained on a data with texts grouped to a block size of 512, it handles text sequence up to 512 tokens and may not perform satisfactorily on shorter sequences.

	## Usage

	This model can be used directly with a pipeline for masked language modeling:

	```python
	>>> from transformers import pipeline
	>>> unmasker = pipeline('fill-mask', model='Sakonii/distilbert-base-nepali')
	>>> unmasker("मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, <mask>, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।")

	[{'score': 0.04128897562623024,
	'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, मौसम, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
	'token': 2605,
	'token_str': 'मौसम'},
	{'score': 0.04100276157259941,
	'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, प्रकृति, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
	'token': 2792,
	'token_str': 'प्रकृति'},
	{'score': 0.026525357738137245,
	'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, पानी, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
	'token': 387,
	'token_str': 'पानी'},
	{'score': 0.02340106852352619,
	'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, जल, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
	'token': 1313,
	'token_str': 'जल'},
	{'score': 0.02055591531097889,
	'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, वातावरण, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
	'token': 790,
	'token_str': 'वातावरण'}]
	```

	Here is how we can use the model to get the features of a given text in PyTorch:

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	tokenizer = AutoTokenizer.from_pretrained('Sakonii/distilbert-base-nepali')
	model = AutoModelForMaskedLM.from_pretrained('Sakonii/distilbert-base-nepali')

	# prepare input
	text = "चाहिएको text यता राख्नु होला।"
	encoded_input = tokenizer(text, return_tensors='pt')

	# forward pass
	output = model(**encoded_input)
	```

	## Training data

	This model is trained on [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) language modeling dataset which combines the datasets: [OSCAR](https://huggingface.co/datasets/oscar) , [cc100](https://huggingface.co/datasets/cc100) and a set of scraped Nepali articles on Wikipedia.
	As for training the language model, the texts in the training set are grouped to a block of 512 tokens.

	## Tokenization

	A Sentence Piece Model (SPM) is trained on a subset of [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) dataset for text tokenization. The tokenizer trained with vocab-size=24576, min-frequency=4, limit-alphabet=1000 and model-max-length=512.

	## Training procedure

	The model is trained with the same configuration as the original [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased); 512 tokens per instance, 28 instances per batch, and around 35.7K training steps.

	### Training hyperparameters

	The following hyperparameters were used for training of the final epoch: [ Refer to the Training results table below for varying hyperparameters every epoch ]
	- learning_rate: 5e-05
	- train_batch_size: 28
	- eval_batch_size: 8
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- num_epochs: 1
	- mixed_precision_training: Native AMP

	### Training results

	The model is trained for 4 epochs with varying hyperparameters:

	\| Training Loss \| Epoch \| MLM Probability \| Train Batch Size \| Step \| Validation Loss \| Perplexity \|
	\|:-------------:\|:-----:\|:---------------:\|:----------------:\|:-----:\|:---------------:\|:----------:\|
	\| 3.4477 \| 1.0 \| 15 \| 26 \| 38864 \| 3.3067 \| 27.2949 \|
	\| 2.9451 \| 2.0 \| 15 \| 28 \| 35715 \| 2.8238 \| 16.8407 \|
	\| 2.866 \| 3.0 \| 20 \| 28 \| 35715 \| 2.7431 \| 15.5351 \|
	\| 2.7287 \| 4.0 \| 20 \| 28 \| 35715 \| 2.6053 \| 13.5353 \|

	Final model evaluated with MLM Probability of 15%:

	\| Training Loss \| Epoch \| MLM Probability \| Train Batch Size \| Step \| Validation Loss \| Perplexity \|
	\|:-------------:\|:-----:\|:---------------:\|:----------------:\|:-----:\|:---------------:\|:----------:\|
	\| - \| - \| 15 \| - \| - \| 2.4388 \| 11.4589 \|


	### Framework versions

	- Transformers 4.16.2
	- Pytorch 1.9.1
	- Datasets 1.18.3
	- Tokenizers 0.10.3