manhtt-079 commited on
Commit
2fa85e3
1 Parent(s): 3af6236

update README

Browse files
Files changed (1) hide show
  1. README.md +6 -6
README.md CHANGED
@@ -24,7 +24,7 @@ tags:
24
  ## Model variations
25
 
26
  ## How to use
27
- You can use this model directly with a pipeline for masked language modeling:
28
  **_NOTE:_** The input text should be already word-segmented, you can use [Pyvi](https://github.com/trungtv/pyvi) (Python Vietnamese Core NLP Toolkit) to segment word before passing to the model.
29
  ```python
30
  >>> from transformers import pipeline
@@ -75,13 +75,13 @@ model_inputs = tokenizer(text, return_tensors='tf')
75
  outputs = model(**model_inputs)
76
  ```
77
 
78
- ## Training data
79
- The ViPubMedDeBERTa model was pretrained on [ViPubmed](https://github.com/vietai/ViPubmed), a dataset consisting of 20M Vietnamese Biomedical abstracts generated by large scale translation.
80
 
81
  ## Training procedure
82
- ### Preprocessing
83
-
84
  ### Pretraining
85
- We employ our model based on the [ViDeBERTa](https://github.com/HySonLab/ViDeBERTa) and leverage its checkpoint to continue pretraining. Our model was trained on a A100 GPU (40GB) for 220 thousand steps with `batch_size` of 24 and `gradient_accumulation_steps` is 4 (total of 96). The sequence length was limited to 512 tokens. The model peak learning rate of 1e-4.
86
 
87
  ## Evaluation results
 
24
  ## Model variations
25
 
26
  ## How to use
27
+ You can use this model directly with a pipeline for masked language modeling:<br>
28
  **_NOTE:_** The input text should be already word-segmented, you can use [Pyvi](https://github.com/trungtv/pyvi) (Python Vietnamese Core NLP Toolkit) to segment word before passing to the model.
29
  ```python
30
  >>> from transformers import pipeline
 
75
  outputs = model(**model_inputs)
76
  ```
77
 
78
+ ## Pre-training data
79
+ The ViPubMedDeBERTa model was pre-trained on [ViPubmed](https://github.com/vietai/ViPubmed), a dataset consisting of 20M Vietnamese Biomedical abstracts generated by large scale translation.
80
 
81
  ## Training procedure
82
+ ### Data deduplication
83
+ A fuzzy deduplication, targeting documents with high overlap, was conducted at the document level to enhance quality and address overfitting. Employing Locality Sensitive Hashing (LSH) with a threshold of 0.9 ensured the removal of documents with overlap exceeding 90%. This process resulted in an average reduction of the dataset's size by 3%.
84
  ### Pretraining
85
+ We employ our model based on the [ViDeBERTa](https://github.com/HySonLab/ViDeBERTa) architecture and leverage its pre-trained checkpoint to continue pre-training. Our model was trained on a single A100 GPU (40GB) for 220 thousand steps, with a batch size of 24 and gradient accumulation steps set to 4 (resulting in a total of 96). The sequence length was limited to 512 tokens and the model peak learning rate of 1e-4.
86
 
87
  ## Evaluation results