Add citation for Thai dataset

#5
Files changed (1) hide show
  1. README.md +15 -1
README.md CHANGED
@@ -52,7 +52,8 @@ SEA-LION was trained on 980B tokens of the following data:
52
  | mC4 - Filipino | 5.3B | 0.54% |
53
  | mC4 - Burmese | 4.9B | 0.49% |
54
  | mC4 - Vietnamese | 63.4B | 6.46% |
55
- | mC4 - Thai | 21.6B | 2.20% |
 
56
  | mC4 - Lao | 1.1B | 0.12% |
57
  | mC4 - Khmer | 3.9B | 0.40% |
58
  | mC4 - Tamil | 10.2B | 1.04% |
@@ -153,3 +154,16 @@ The model has _not_ been aligned for safety.
153
  Developers and users should perform their own safety fine-tuning and related security measures.
154
  In no event shall the authors be held liable for any claim, damages, or other liability
155
  arising from the use of the released weights and codes.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
  | mC4 - Filipino | 5.3B | 0.54% |
53
  | mC4 - Burmese | 4.9B | 0.49% |
54
  | mC4 - Vietnamese | 63.4B | 6.46% |
55
+ | mC4 - Thai | 11.6B | 1.18% |
56
+ | WangChanBERTa - Thai | 10B | 1.02% |
57
  | mC4 - Lao | 1.1B | 0.12% |
58
  | mC4 - Khmer | 3.9B | 0.40% |
59
  | mC4 - Tamil | 10.2B | 1.04% |
 
154
  Developers and users should perform their own safety fine-tuning and related security measures.
155
  In no event shall the authors be held liable for any claim, damages, or other liability
156
  arising from the use of the released weights and codes.
157
+
158
+ ## Citations
159
+
160
+ ```bibtex
161
+ @misc{lowphansirikul2021wangchanberta,
162
+ title={WangchanBERTa: Pretraining transformer-based Thai Language Models},
163
+ author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
164
+ year={2021},
165
+ eprint={2101.09635},
166
+ archivePrefix={arXiv},
167
+ primaryClass={cs.CL}
168
+ }
169
+ ```