Update README.md
Browse files
README.md
CHANGED
@@ -1,7 +1,9 @@
|
|
1 |
-
# `wangchanberta-base-wiki-spm`
|
2 |
|
3 |
-
|
4 |
|
|
|
|
|
5 |
<br>
|
6 |
|
7 |
## Model description
|
@@ -62,6 +64,8 @@ You can use the pretrained model for masked language modeling (i.e. predicting a
|
|
62 |
|
63 |
<br>
|
64 |
|
|
|
|
|
65 |
<br>
|
66 |
|
67 |
## Training data
|
@@ -84,12 +88,22 @@ Regarding the vocabulary, we use subword token trained with [SentencePice](https
|
|
84 |
|
85 |
We sample sentences contigously to have the length of at most 512 tokens. For some sentences that overlap the boundary of 512 tokens, we split such sentence with an additional token as document separator. This is the same approach as proposed by [[Liu et al., 2019]](https://arxiv.org/abs/1907.11692) (called "FULL-SENTENCES").
|
86 |
|
87 |
-
|
88 |
|
|
|
89 |
|
90 |
**Train/Val/Test splits**
|
91 |
|
92 |
We split sequencially 944,782 sentences for training set, 24,863 sentences for validation set and 24,862 sentences for test set.
|
93 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
94 |
|
95 |
-
|
|
|
1 |
+
# WangchanBERTa base model: `wangchanberta-base-wiki-spm`
|
2 |
|
3 |
+
<br>
|
4 |
|
5 |
+
Pretrained RoBERTa BASE model on Thai Wikipedia corpus.
|
6 |
+
The script and documentation can be found at [this reposiryory](https://github.com/vistec-AI/thai2transformers).
|
7 |
<br>
|
8 |
|
9 |
## Model description
|
|
|
64 |
|
65 |
<br>
|
66 |
|
67 |
+
The getting started notebook of WangchanBERTa model can be found at this [Colab notebook](https://colab.research.google.com/drive/15GdAgMcTxhWfDia0OB76F1l0fKZ059Ez?usp=sharing#scrollTo=eVlb6CEDmYi3)
|
68 |
+
|
69 |
<br>
|
70 |
|
71 |
## Training data
|
|
|
88 |
|
89 |
We sample sentences contigously to have the length of at most 512 tokens. For some sentences that overlap the boundary of 512 tokens, we split such sentence with an additional token as document separator. This is the same approach as proposed by [[Liu et al., 2019]](https://arxiv.org/abs/1907.11692) (called "FULL-SENTENCES").
|
90 |
|
91 |
+
Regarding the masking procedure, for each sequence, we sampled 15% of the tokens and replace them with<mask>token.Out of the 15%, 80% is replaced with a<mask>token, 10% is left unchanged and 10% is replaced with a random token.
|
92 |
|
93 |
+
<br>
|
94 |
|
95 |
**Train/Val/Test splits**
|
96 |
|
97 |
We split sequencially 944,782 sentences for training set, 24,863 sentences for validation set and 24,862 sentences for test set.
|
98 |
|
99 |
+
<br>
|
100 |
+
|
101 |
+
**Pretraining**
|
102 |
+
|
103 |
+
The model was trained on 32 V100 GPUs for 31,250 steps with the batch size of 8,192 (16 mini batches per device with 16 accumulation steps) and a sequence length of 512 tokens. The optimizer we used is Adam with the learning rate of $7e-4$, $\beta_1 = 0.9$, $\beta_2= 0.98$ and $\epsilon = 1e-6$. The learning rate is warmed up for the first 1250 steps and linearly decayed to zero. The model checkpoint with minimum validation loss will be selected as the best model checkpoint.
|
104 |
+
|
105 |
+
<br>
|
106 |
+
|
107 |
+
**BibTeX entry and citation info**
|
108 |
|
109 |
+
[Coming Soon]
|