aisingapore
/

sea-lion-7b

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Add citation for Thai dataset

#5

by RaymondAISG - opened Jan 3, 2024

base: refs/heads/main

←

from: refs/pr/5

Discussion Files changed

Files changed (1) hide show

README.md +15 -1

README.md CHANGED Viewed

@@ -52,7 +52,8 @@ SEA-LION was trained on 980B tokens of the following data:
 | mC4 - Filipino            |   5.3B |      0.54% |
 | mC4 - Burmese             |   4.9B |      0.49% |
 | mC4 - Vietnamese          |  63.4B |      6.46% |
-| mC4 - Thai                |  21.6B |      2.20% |
 | mC4 - Lao                 |   1.1B |      0.12% |
 | mC4 - Khmer               |   3.9B |      0.40% |
 | mC4 - Tamil               |  10.2B |      1.04% |
@@ -153,3 +154,16 @@ The model has _not_ been aligned for safety.
 Developers and users should perform their own safety fine-tuning and related security measures.
 In no event shall the authors be held liable for any claim, damages, or other liability
 arising from the use of the released weights and codes.

 | mC4 - Filipino            |   5.3B |      0.54% |
 | mC4 - Burmese             |   4.9B |      0.49% |
 | mC4 - Vietnamese          |  63.4B |      6.46% |
+| mC4 - Thai                |  11.6B |      1.18% |
+| WangChanBERTa - Thai      |    10B |      1.02% |
 | mC4 - Lao                 |   1.1B |      0.12% |
 | mC4 - Khmer               |   3.9B |      0.40% |
 | mC4 - Tamil               |  10.2B |      1.04% |
 Developers and users should perform their own safety fine-tuning and related security measures.
 In no event shall the authors be held liable for any claim, damages, or other liability
 arising from the use of the released weights and codes.
+## Citations
+```bibtex
+@misc{lowphansirikul2021wangchanberta,
+    title={WangchanBERTa: Pretraining transformer-based Thai Language Models},
+    author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
+    year={2021},
+    eprint={2101.09635},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```