update README

Files changed (1) hide show

README.md CHANGED Viewed

@@ -24,7 +24,7 @@ tags:
 ## Model variations
 ## How to use
-You can use this model directly with a pipeline for masked language modeling:
 **_NOTE:_**  The input text should be already word-segmented, you can use [Pyvi](https://github.com/trungtv/pyvi) (Python Vietnamese Core NLP Toolkit) to segment word before passing to the model.
 ```python
 >>> from transformers import pipeline
@@ -75,13 +75,13 @@ model_inputs = tokenizer(text, return_tensors='tf')
 outputs = model(**model_inputs)
 ```
-## Training data
-The ViPubMedDeBERTa model was pretrained on [ViPubmed](https://github.com/vietai/ViPubmed), a dataset consisting of 20M Vietnamese Biomedical abstracts generated by large scale translation.
 ## Training procedure
-### Preprocessing
 ### Pretraining
-We employ our model based on the [ViDeBERTa](https://github.com/HySonLab/ViDeBERTa) and leverage its checkpoint to continue pretraining. Our model was trained on a A100 GPU (40GB) for 220 thousand steps with `batch_size` of 24 and `gradient_accumulation_steps` is 4 (total of 96). The sequence length was limited to 512 tokens. The model peak learning rate of 1e-4.
 ## Evaluation results

 ## Model variations
 ## How to use
+You can use this model directly with a pipeline for masked language modeling:<br>
 **_NOTE:_**  The input text should be already word-segmented, you can use [Pyvi](https://github.com/trungtv/pyvi) (Python Vietnamese Core NLP Toolkit) to segment word before passing to the model.
 ```python
 >>> from transformers import pipeline
 outputs = model(**model_inputs)
 ```
+## Pre-training data
+The ViPubMedDeBERTa model was pre-trained on [ViPubmed](https://github.com/vietai/ViPubmed), a dataset consisting of 20M Vietnamese Biomedical abstracts generated by large scale translation.
 ## Training procedure
+### Data deduplication
+A fuzzy deduplication, targeting documents with high overlap, was conducted at the document level to enhance quality and address overfitting. Employing Locality Sensitive Hashing (LSH) with a threshold of 0.9 ensured the removal of documents with overlap exceeding 90%. This process resulted in an average reduction of the dataset's size by 3%.
 ### Pretraining
+We employ our model based on the [ViDeBERTa](https://github.com/HySonLab/ViDeBERTa) architecture and leverage its pre-trained checkpoint to continue pre-training. Our model was trained on a single A100 GPU (40GB) for 220 thousand steps, with a batch size of 24 and gradient accumulation steps set to 4 (resulting in a total of 96). The sequence length was limited to 512 tokens and the model peak learning rate of 1e-4.
 ## Evaluation results