manhtt-079
commited on
Commit
•
2fa85e3
1
Parent(s):
3af6236
update README
Browse files
README.md
CHANGED
@@ -24,7 +24,7 @@ tags:
|
|
24 |
## Model variations
|
25 |
|
26 |
## How to use
|
27 |
-
You can use this model directly with a pipeline for masked language modeling
|
28 |
**_NOTE:_** The input text should be already word-segmented, you can use [Pyvi](https://github.com/trungtv/pyvi) (Python Vietnamese Core NLP Toolkit) to segment word before passing to the model.
|
29 |
```python
|
30 |
>>> from transformers import pipeline
|
@@ -75,13 +75,13 @@ model_inputs = tokenizer(text, return_tensors='tf')
|
|
75 |
outputs = model(**model_inputs)
|
76 |
```
|
77 |
|
78 |
-
##
|
79 |
-
The ViPubMedDeBERTa model was
|
80 |
|
81 |
## Training procedure
|
82 |
-
###
|
83 |
-
|
84 |
### Pretraining
|
85 |
-
We employ our model based on the [ViDeBERTa](https://github.com/HySonLab/ViDeBERTa) and leverage its checkpoint to continue
|
86 |
|
87 |
## Evaluation results
|
|
|
24 |
## Model variations
|
25 |
|
26 |
## How to use
|
27 |
+
You can use this model directly with a pipeline for masked language modeling:<br>
|
28 |
**_NOTE:_** The input text should be already word-segmented, you can use [Pyvi](https://github.com/trungtv/pyvi) (Python Vietnamese Core NLP Toolkit) to segment word before passing to the model.
|
29 |
```python
|
30 |
>>> from transformers import pipeline
|
|
|
75 |
outputs = model(**model_inputs)
|
76 |
```
|
77 |
|
78 |
+
## Pre-training data
|
79 |
+
The ViPubMedDeBERTa model was pre-trained on [ViPubmed](https://github.com/vietai/ViPubmed), a dataset consisting of 20M Vietnamese Biomedical abstracts generated by large scale translation.
|
80 |
|
81 |
## Training procedure
|
82 |
+
### Data deduplication
|
83 |
+
A fuzzy deduplication, targeting documents with high overlap, was conducted at the document level to enhance quality and address overfitting. Employing Locality Sensitive Hashing (LSH) with a threshold of 0.9 ensured the removal of documents with overlap exceeding 90%. This process resulted in an average reduction of the dataset's size by 3%.
|
84 |
### Pretraining
|
85 |
+
We employ our model based on the [ViDeBERTa](https://github.com/HySonLab/ViDeBERTa) architecture and leverage its pre-trained checkpoint to continue pre-training. Our model was trained on a single A100 GPU (40GB) for 220 thousand steps, with a batch size of 24 and gradient accumulation steps set to 4 (resulting in a total of 96). The sequence length was limited to 512 tokens and the model peak learning rate of 1e-4.
|
86 |
|
87 |
## Evaluation results
|