Sakonii
/

distilbert-base-nepali

generated_from_trainer

Inference Endpoints

Model card Files Files and versions Community

Sakonii commited on Feb 26, 2022

Commit

e212baa

•

1 Parent(s): 7f74a44

Update README.md

Files changed (1) hide show

README.md +9 -1

README.md CHANGED Viewed

@@ -32,7 +32,8 @@ Refer to original [distilbert-base-uncased](https://huggingface.co/distilbert-ba
 ## Intended uses & limitations
-This backbone model intends to be fine-tuned on Nepali language focused downstream task such as sequence classification, token classification or question answering.
 ## Usage
@@ -84,9 +85,16 @@ output = model(**encoded_input)
 ## Training data
 This model is trained on [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) language modeling dataset which combines the datasets: [OSCAR](https://huggingface.co/datasets/oscar) , [cc100](https://huggingface.co/datasets/cc100) and a set of scraped Nepali articles on Wikipedia.
 ## Training procedure
 ### Training hyperparameters
 The following hyperparameters were used for training of the final epoch: [ Refer to the *Training results* table below for varying hyperparameters every epoch ]

 ## Intended uses & limitations
+This backbone model intends to be fine-tuned on Nepali language focused downstream task such as sequence classification, token classification or question answering.
+The language model being trained on a data with texts grouped to a block size of 512, it handles text sequence up to 512 tokens and may not perform satisfactorily on shorter sequences.
 ## Usage
 ## Training data
 This model is trained on [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) language modeling dataset which combines the datasets: [OSCAR](https://huggingface.co/datasets/oscar) , [cc100](https://huggingface.co/datasets/cc100) and a set of scraped Nepali articles on Wikipedia.
+As for training the language model, the texts in the training set are grouped to a block of 512 tokens.
+## Tokenization
+A Sentence Piece Model (SPM) is trained on a subset of [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) dataset for text tokenization. The tokenizer trained with vocab-size=24576, min-frequency=4, limit-alphabet=1000 and model-max-length=512.
 ## Training procedure
+The model is trained with the same configuration as the original [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased); 512 tokens per instance, 28 instances per batch, and around 35.7K training steps.
 ### Training hyperparameters
 The following hyperparameters were used for training of the final epoch: [ Refer to the *Training results* table below for varying hyperparameters every epoch ]