Update README.md
Browse files
README.md
CHANGED
@@ -32,7 +32,8 @@ Refer to original [distilbert-base-uncased](https://huggingface.co/distilbert-ba
|
|
32 |
|
33 |
## Intended uses & limitations
|
34 |
|
35 |
-
This backbone model intends to be fine-tuned on Nepali language focused downstream task such as sequence classification, token classification or question answering.
|
|
|
36 |
|
37 |
## Usage
|
38 |
|
@@ -84,9 +85,16 @@ output = model(**encoded_input)
|
|
84 |
## Training data
|
85 |
|
86 |
This model is trained on [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) language modeling dataset which combines the datasets: [OSCAR](https://huggingface.co/datasets/oscar) , [cc100](https://huggingface.co/datasets/cc100) and a set of scraped Nepali articles on Wikipedia.
|
|
|
|
|
|
|
|
|
|
|
87 |
|
88 |
## Training procedure
|
89 |
|
|
|
|
|
90 |
### Training hyperparameters
|
91 |
|
92 |
The following hyperparameters were used for training of the final epoch: [ Refer to the *Training results* table below for varying hyperparameters every epoch ]
|
|
|
32 |
|
33 |
## Intended uses & limitations
|
34 |
|
35 |
+
This backbone model intends to be fine-tuned on Nepali language focused downstream task such as sequence classification, token classification or question answering.
|
36 |
+
The language model being trained on a data with texts grouped to a block size of 512, it handles text sequence up to 512 tokens and may not perform satisfactorily on shorter sequences.
|
37 |
|
38 |
## Usage
|
39 |
|
|
|
85 |
## Training data
|
86 |
|
87 |
This model is trained on [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) language modeling dataset which combines the datasets: [OSCAR](https://huggingface.co/datasets/oscar) , [cc100](https://huggingface.co/datasets/cc100) and a set of scraped Nepali articles on Wikipedia.
|
88 |
+
As for training the language model, the texts in the training set are grouped to a block of 512 tokens.
|
89 |
+
|
90 |
+
## Tokenization
|
91 |
+
|
92 |
+
A Sentence Piece Model (SPM) is trained on a subset of [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) dataset for text tokenization. The tokenizer trained with vocab-size=24576, min-frequency=4, limit-alphabet=1000 and model-max-length=512.
|
93 |
|
94 |
## Training procedure
|
95 |
|
96 |
+
The model is trained with the same configuration as the original [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased); 512 tokens per instance, 28 instances per batch, and around 35.7K training steps.
|
97 |
+
|
98 |
### Training hyperparameters
|
99 |
|
100 |
The following hyperparameters were used for training of the final epoch: [ Refer to the *Training results* table below for varying hyperparameters every epoch ]
|