Initial Commit
Browse files
README.md
CHANGED
@@ -1,3 +1,162 @@
|
|
1 |
-
---
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- en
|
4 |
+
- id
|
5 |
+
- ta
|
6 |
+
- th
|
7 |
+
- vi
|
8 |
+
license: gemma
|
9 |
+
---
|
10 |
+
# Gemma2 9B CPT SEA-LIONv3
|
11 |
+
|
12 |
+
SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
|
13 |
+
This is the card for the Gemma2 9B CPT SEA-LIONv3 base model which has undergone continued pre-training from the base [Gemma-2-9B](https://huggingface.co/google/gemma-2-9b) model.
|
14 |
+
|
15 |
+
SEA-LION stands for <i>Southeast Asian Languages In One Network</i>.
|
16 |
+
|
17 |
+
|
18 |
+
## Model Details
|
19 |
+
|
20 |
+
### Model Description
|
21 |
+
|
22 |
+
The continued pre-training data for Gemma2 9B CPT SEA-LIONv3 base model encompasses approximately 48B tokens.
|
23 |
+
|
24 |
+
- **Developed by:** Products Pillar, AI Singapore
|
25 |
+
- **Funded by:** Singapore NRF
|
26 |
+
- **Model type:** Decoder
|
27 |
+
- **Languages:** English, Indonesian, Thai, Vietnamese, Tamil
|
28 |
+
- **License:** [Gemma Community License](https://ai.google.dev/gemma/terms)
|
29 |
+
|
30 |
+
For tokenization, the model employs the default tokenizer used in Gemma-2-9B.
|
31 |
+
|
32 |
+
### Benchmark Performance
|
33 |
+
We evaluated Gemma2 9B CPT SEA-LIONv3 base model on general language capabilities.
|
34 |
+
|
35 |
+
#### General Language Capabilities
|
36 |
+
For the evaluation of general language capabilities in SEA languages, we employed the [BHASA evaluation benchmark](https://arxiv.org/abs/2309.06085v2) across a variety of tasks.
|
37 |
+
These tasks include Question Answering (QA), Sentiment Analysis (Sentiment), Toxicity Detection (Toxicity), Translation in both directions (Eng>Lang & Lang>Eng), Abstractive Summarization (Summ), Causal Reasoning (Causal) and Natural Language Inference (NLI).
|
38 |
+
|
39 |
+
The evaluation was done **five-shot** with native prompts and only a sample of 100-1000 instances for each dataset was used as per the setting described in the paper.
|
40 |
+
|
41 |
+
For more details on Gemma2 9B CPT SEA-LIONv3 base benchmark performance, please refer to the SEA HELM leaderboard, https://leaderboard.sea-lion.ai/
|
42 |
+
|
43 |
+
## Training Details
|
44 |
+
|
45 |
+
### Data
|
46 |
+
|
47 |
+
Gemma2 9B CPT SEA-LIONv3 base model was continued pre-trained on 48B tokens of the following data:
|
48 |
+
|
49 |
+
| Data Source | Unique Tokens (B) | Multiplier | Total Tokens (B) | Percentage (%) |
|
50 |
+
|---------------------------|:-----------------:|:----------:|:----------------:|:--------------:|
|
51 |
+
| Dolma RefinedWeb - English| 7.650 | 1 | 7.650 | 15.90 |
|
52 |
+
| Dolma C4 - English | 1.160 | 1 | 1.16 | 9.21 |
|
53 |
+
| Dolma Reddit - English | 1.339 | 1 | 1.339 | 2.42 |
|
54 |
+
| Dolma Semantic Scholar | 0.959 | 1 | 0.959 | 2.79 |
|
55 |
+
| Dolma arXiv | 0.469 | 1 | 0.469 | 1.99 |
|
56 |
+
| Dolma StarCoder | 4.422 | 1 | 4.422 | 0.98 |
|
57 |
+
| SEA-LION Pile - Indonesian| 3.4 | 2 | 6.8 | 14.17 |
|
58 |
+
| Wiki* - Indonesian | 0.3 | 4 | 1.2 | 2.50 |
|
59 |
+
| SEA-LION Pile - Tamil | 5.6 | 1 | 5.6 | 11.67 |
|
60 |
+
| Wiki* + News - Tamil | 0.6 | 4 | 2.4 | 5.00 |
|
61 |
+
| SEA-LION Pile - Thai | 2.28 | 1 | 2.28 | 4.75 |
|
62 |
+
| WangChanBERTa - Thai | 5 | 1 | 5 | 10.42 |
|
63 |
+
| Wiki* - Thai | 0.18 | 4 | 0.72 | 1.50 |
|
64 |
+
| SEA-LION Pile - Vietnamese| 6.76 | 1 | 6.76 | 14.08 |
|
65 |
+
| Wiki* - Vietnamese | 0.31 | 4 | 1.24 | 2.58 |
|
66 |
+
|
67 |
+
Note:
|
68 |
+
- All token counts are counted using Gemma2 tokenizer
|
69 |
+
- wiki* sources includes Wikipedia, Wiki Books, Wiki Source and Wiki Voyage
|
70 |
+
- Tamil news is sourced with permission from [Seithi](https://seithi.mediacorp.sg/)
|
71 |
+
|
72 |
+
### Infrastructure
|
73 |
+
|
74 |
+
Gemma2 9B CPT SEA-LIONv3 was trained using [MosaicML Composer](https://github.com/mosaicml/composer)
|
75 |
+
on the following hardware:
|
76 |
+
|
77 |
+
| Training Details | Gemma2 9B CPT SEA-LIONv3 |
|
78 |
+
|----------------------|:--------------------:|
|
79 |
+
| AWS EC2 p5d.24xlarge | 8 instances |
|
80 |
+
| Nvidia H100 80GB GPU | 64 |
|
81 |
+
| Training Duration | 2 days |
|
82 |
+
|
83 |
+
|
84 |
+
### Configuration
|
85 |
+
|
86 |
+
| HyperParameter | Gemma2 9B CPT SEA-LIONv32 |
|
87 |
+
|-------------------|:--------------------:|
|
88 |
+
| Precision | bfloat16 |
|
89 |
+
| Optimizer | decoupled_adamw |
|
90 |
+
| Scheduler | weight_stable_decay |
|
91 |
+
| Learning Rate | 1.0e-5 |
|
92 |
+
| Global Batch Size | 512 |
|
93 |
+
| Micro Batch Size | 2 |
|
94 |
+
|
95 |
+
|
96 |
+
## The Team
|
97 |
+
|
98 |
+
Choa Esther<br>
|
99 |
+
Cheng Nicholas<br>
|
100 |
+
Huang Yuli<br>
|
101 |
+
Lau Wayne<br>
|
102 |
+
Lee Chwan Ren<br>
|
103 |
+
Leong Wai Yi<br>
|
104 |
+
Leong Wei Qi<br>
|
105 |
+
Li Yier<br>
|
106 |
+
Liu Bing Jie Darius<br>
|
107 |
+
Lovenia Holy<br>
|
108 |
+
Montalan Jann Railey<br>
|
109 |
+
Ng Boon Cheong Raymond<br>
|
110 |
+
Ngui Jian Gang<br>
|
111 |
+
Nguyen Thanh Ngan<br>
|
112 |
+
Ong Brandon<br>
|
113 |
+
Ong Tat-Wee David<br>
|
114 |
+
Ong Zhi Hao<br>
|
115 |
+
Rengarajan Hamsawardhini<br>
|
116 |
+
Siow Bryan<br>
|
117 |
+
Susanto Yosephine<br>
|
118 |
+
Tai Ngee Chia<br>
|
119 |
+
Tan Choon Meng<br>
|
120 |
+
Teo Eng Sipp Leslie<br>
|
121 |
+
Teo Wei Yi<br>
|
122 |
+
Tjhi William<br>
|
123 |
+
Teng Walter<br>
|
124 |
+
Yeo Yeow Tong<br>
|
125 |
+
Yong Xianbin<br>
|
126 |
+
|
127 |
+
|
128 |
+
## Acknowledgements
|
129 |
+
|
130 |
+
AI Singapore is a national programme supported by the National Research Foundation, Singapore and hosted by the National University of Singapore.
|
131 |
+
Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.
|
132 |
+
|
133 |
+
|
134 |
+
## Contact
|
135 |
+
|
136 |
+
For more info, please contact us using this [SEA-LION Inquiry Form](https://forms.gle/sLCUVb95wmGf43hi6)
|
137 |
+
|
138 |
+
[Link to SEA-LION's GitHub repository](https://github.com/aisingapore/sealion)
|
139 |
+
|
140 |
+
|
141 |
+
## Disclaimer
|
142 |
+
|
143 |
+
This the repository for the base model.
|
144 |
+
The model has _not_ been aligned for safety.
|
145 |
+
Developers and users should perform their own safety fine-tuning and related security measures.
|
146 |
+
In no event shall the authors be held liable for any claim, damages, or other liability
|
147 |
+
arising from the use of the released weights and codes.
|
148 |
+
|
149 |
+
|
150 |
+
## References
|
151 |
+
### Thai Pre-Training Data Reference
|
152 |
+
|
153 |
+
```bibtex
|
154 |
+
@misc{lowphansirikul2021wangchanberta,
|
155 |
+
title={WangchanBERTa: Pretraining transformer-based Thai Language Models},
|
156 |
+
author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
|
157 |
+
year={2021},
|
158 |
+
eprint={2101.09635},
|
159 |
+
archivePrefix={arXiv},
|
160 |
+
primaryClass={cs.CL}
|
161 |
+
}
|
162 |
+
```
|