Add citation for Thai dataset
#5
by
RaymondAISG
- opened
README.md
CHANGED
@@ -52,7 +52,8 @@ SEA-LION was trained on 980B tokens of the following data:
|
|
52 |
| mC4 - Filipino | 5.3B | 0.54% |
|
53 |
| mC4 - Burmese | 4.9B | 0.49% |
|
54 |
| mC4 - Vietnamese | 63.4B | 6.46% |
|
55 |
-
| mC4 - Thai |
|
|
|
56 |
| mC4 - Lao | 1.1B | 0.12% |
|
57 |
| mC4 - Khmer | 3.9B | 0.40% |
|
58 |
| mC4 - Tamil | 10.2B | 1.04% |
|
@@ -153,3 +154,16 @@ The model has _not_ been aligned for safety.
|
|
153 |
Developers and users should perform their own safety fine-tuning and related security measures.
|
154 |
In no event shall the authors be held liable for any claim, damages, or other liability
|
155 |
arising from the use of the released weights and codes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
52 |
| mC4 - Filipino | 5.3B | 0.54% |
|
53 |
| mC4 - Burmese | 4.9B | 0.49% |
|
54 |
| mC4 - Vietnamese | 63.4B | 6.46% |
|
55 |
+
| mC4 - Thai | 11.6B | 1.18% |
|
56 |
+
| WangChanBERTa - Thai | 10B | 1.02% |
|
57 |
| mC4 - Lao | 1.1B | 0.12% |
|
58 |
| mC4 - Khmer | 3.9B | 0.40% |
|
59 |
| mC4 - Tamil | 10.2B | 1.04% |
|
|
|
154 |
Developers and users should perform their own safety fine-tuning and related security measures.
|
155 |
In no event shall the authors be held liable for any claim, damages, or other liability
|
156 |
arising from the use of the released weights and codes.
|
157 |
+
|
158 |
+
## Citations
|
159 |
+
|
160 |
+
```bibtex
|
161 |
+
@misc{lowphansirikul2021wangchanberta,
|
162 |
+
title={WangchanBERTa: Pretraining transformer-based Thai Language Models},
|
163 |
+
author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
|
164 |
+
year={2021},
|
165 |
+
eprint={2101.09635},
|
166 |
+
archivePrefix={arXiv},
|
167 |
+
primaryClass={cs.CL}
|
168 |
+
}
|
169 |
+
```
|