Fill-Mask
Transformers
PyTorch
Safetensors
distilbert
generated_from_trainer
Inference Endpoints
Sakonii commited on
Commit
e212baa
1 Parent(s): 7f74a44

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -1
README.md CHANGED
@@ -32,7 +32,8 @@ Refer to original [distilbert-base-uncased](https://huggingface.co/distilbert-ba
32
 
33
  ## Intended uses & limitations
34
 
35
- This backbone model intends to be fine-tuned on Nepali language focused downstream task such as sequence classification, token classification or question answering.
 
36
 
37
  ## Usage
38
 
@@ -84,9 +85,16 @@ output = model(**encoded_input)
84
  ## Training data
85
 
86
  This model is trained on [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) language modeling dataset which combines the datasets: [OSCAR](https://huggingface.co/datasets/oscar) , [cc100](https://huggingface.co/datasets/cc100) and a set of scraped Nepali articles on Wikipedia.
 
 
 
 
 
87
 
88
  ## Training procedure
89
 
 
 
90
  ### Training hyperparameters
91
 
92
  The following hyperparameters were used for training of the final epoch: [ Refer to the *Training results* table below for varying hyperparameters every epoch ]
 
32
 
33
  ## Intended uses & limitations
34
 
35
+ This backbone model intends to be fine-tuned on Nepali language focused downstream task such as sequence classification, token classification or question answering.
36
+ The language model being trained on a data with texts grouped to a block size of 512, it handles text sequence up to 512 tokens and may not perform satisfactorily on shorter sequences.
37
 
38
  ## Usage
39
 
 
85
  ## Training data
86
 
87
  This model is trained on [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) language modeling dataset which combines the datasets: [OSCAR](https://huggingface.co/datasets/oscar) , [cc100](https://huggingface.co/datasets/cc100) and a set of scraped Nepali articles on Wikipedia.
88
+ As for training the language model, the texts in the training set are grouped to a block of 512 tokens.
89
+
90
+ ## Tokenization
91
+
92
+ A Sentence Piece Model (SPM) is trained on a subset of [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) dataset for text tokenization. The tokenizer trained with vocab-size=24576, min-frequency=4, limit-alphabet=1000 and model-max-length=512.
93
 
94
  ## Training procedure
95
 
96
+ The model is trained with the same configuration as the original [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased); 512 tokens per instance, 28 instances per batch, and around 35.7K training steps.
97
+
98
  ### Training hyperparameters
99
 
100
  The following hyperparameters were used for training of the final epoch: [ Refer to the *Training results* table below for varying hyperparameters every epoch ]