tainc commited on
Commit
68a945c
·
verified ·
1 Parent(s): cd47a06

Initial Commit

Browse files
Files changed (1) hide show
  1. README.md +162 -3
README.md CHANGED
@@ -1,3 +1,162 @@
1
- ---
2
- license: gemma
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - id
5
+ - ta
6
+ - th
7
+ - vi
8
+ license: gemma
9
+ ---
10
+ # Gemma2 9B CPT SEA-LIONv3
11
+
12
+ SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
13
+ This is the card for the Gemma2 9B CPT SEA-LIONv3 base model which has undergone continued pre-training from the base [Gemma-2-9B](https://huggingface.co/google/gemma-2-9b) model.
14
+
15
+ SEA-LION stands for <i>Southeast Asian Languages In One Network</i>.
16
+
17
+
18
+ ## Model Details
19
+
20
+ ### Model Description
21
+
22
+ The continued pre-training data for Gemma2 9B CPT SEA-LIONv3 base model encompasses approximately 48B tokens.
23
+
24
+ - **Developed by:** Products Pillar, AI Singapore
25
+ - **Funded by:** Singapore NRF
26
+ - **Model type:** Decoder
27
+ - **Languages:** English, Indonesian, Thai, Vietnamese, Tamil
28
+ - **License:** [Gemma Community License](https://ai.google.dev/gemma/terms)
29
+
30
+ For tokenization, the model employs the default tokenizer used in Gemma-2-9B.
31
+
32
+ ### Benchmark Performance
33
+ We evaluated Gemma2 9B CPT SEA-LIONv3 base model on general language capabilities.
34
+
35
+ #### General Language Capabilities
36
+ For the evaluation of general language capabilities in SEA languages, we employed the [BHASA evaluation benchmark](https://arxiv.org/abs/2309.06085v2) across a variety of tasks.
37
+ These tasks include Question Answering (QA), Sentiment Analysis (Sentiment), Toxicity Detection (Toxicity), Translation in both directions (Eng>Lang & Lang>Eng), Abstractive Summarization (Summ), Causal Reasoning (Causal) and Natural Language Inference (NLI).
38
+
39
+ The evaluation was done **five-shot** with native prompts and only a sample of 100-1000 instances for each dataset was used as per the setting described in the paper.
40
+
41
+ For more details on Gemma2 9B CPT SEA-LIONv3 base benchmark performance, please refer to the SEA HELM leaderboard, https://leaderboard.sea-lion.ai/
42
+
43
+ ## Training Details
44
+
45
+ ### Data
46
+
47
+ Gemma2 9B CPT SEA-LIONv3 base model was continued pre-trained on 48B tokens of the following data:
48
+
49
+ | Data Source | Unique Tokens (B) | Multiplier | Total Tokens (B) | Percentage (%) |
50
+ |---------------------------|:-----------------:|:----------:|:----------------:|:--------------:|
51
+ | Dolma RefinedWeb - English| 7.650 | 1 | 7.650 | 15.90 |
52
+ | Dolma C4 - English | 1.160 | 1 | 1.16 | 9.21 |
53
+ | Dolma Reddit - English | 1.339 | 1 | 1.339 | 2.42 |
54
+ | Dolma Semantic Scholar | 0.959 | 1 | 0.959 | 2.79 |
55
+ | Dolma arXiv | 0.469 | 1 | 0.469 | 1.99 |
56
+ | Dolma StarCoder | 4.422 | 1 | 4.422 | 0.98 |
57
+ | SEA-LION Pile - Indonesian| 3.4 | 2 | 6.8 | 14.17 |
58
+ | Wiki* - Indonesian | 0.3 | 4 | 1.2 | 2.50 |
59
+ | SEA-LION Pile - Tamil | 5.6 | 1 | 5.6 | 11.67 |
60
+ | Wiki* + News - Tamil | 0.6 | 4 | 2.4 | 5.00 |
61
+ | SEA-LION Pile - Thai | 2.28 | 1 | 2.28 | 4.75 |
62
+ | WangChanBERTa - Thai | 5 | 1 | 5 | 10.42 |
63
+ | Wiki* - Thai | 0.18 | 4 | 0.72 | 1.50 |
64
+ | SEA-LION Pile - Vietnamese| 6.76 | 1 | 6.76 | 14.08 |
65
+ | Wiki* - Vietnamese | 0.31 | 4 | 1.24 | 2.58 |
66
+
67
+ Note:
68
+ - All token counts are counted using Gemma2 tokenizer
69
+ - wiki* sources includes Wikipedia, Wiki Books, Wiki Source and Wiki Voyage
70
+ - Tamil news is sourced with permission from [Seithi](https://seithi.mediacorp.sg/)
71
+
72
+ ### Infrastructure
73
+
74
+ Gemma2 9B CPT SEA-LIONv3 was trained using [MosaicML Composer](https://github.com/mosaicml/composer)
75
+ on the following hardware:
76
+
77
+ | Training Details | Gemma2 9B CPT SEA-LIONv3 |
78
+ |----------------------|:--------------------:|
79
+ | AWS EC2 p5d.24xlarge | 8 instances |
80
+ | Nvidia H100 80GB GPU | 64 |
81
+ | Training Duration | 2 days |
82
+
83
+
84
+ ### Configuration
85
+
86
+ | HyperParameter | Gemma2 9B CPT SEA-LIONv32 |
87
+ |-------------------|:--------------------:|
88
+ | Precision | bfloat16 |
89
+ | Optimizer | decoupled_adamw |
90
+ | Scheduler | weight_stable_decay |
91
+ | Learning Rate | 1.0e-5 |
92
+ | Global Batch Size | 512 |
93
+ | Micro Batch Size | 2 |
94
+
95
+
96
+ ## The Team
97
+
98
+ Choa Esther<br>
99
+ Cheng Nicholas<br>
100
+ Huang Yuli<br>
101
+ Lau Wayne<br>
102
+ Lee Chwan Ren<br>
103
+ Leong Wai Yi<br>
104
+ Leong Wei Qi<br>
105
+ Li Yier<br>
106
+ Liu Bing Jie Darius<br>
107
+ Lovenia Holy<br>
108
+ Montalan Jann Railey<br>
109
+ Ng Boon Cheong Raymond<br>
110
+ Ngui Jian Gang<br>
111
+ Nguyen Thanh Ngan<br>
112
+ Ong Brandon<br>
113
+ Ong Tat-Wee David<br>
114
+ Ong Zhi Hao<br>
115
+ Rengarajan Hamsawardhini<br>
116
+ Siow Bryan<br>
117
+ Susanto Yosephine<br>
118
+ Tai Ngee Chia<br>
119
+ Tan Choon Meng<br>
120
+ Teo Eng Sipp Leslie<br>
121
+ Teo Wei Yi<br>
122
+ Tjhi William<br>
123
+ Teng Walter<br>
124
+ Yeo Yeow Tong<br>
125
+ Yong Xianbin<br>
126
+
127
+
128
+ ## Acknowledgements
129
+
130
+ AI Singapore is a national programme supported by the National Research Foundation, Singapore and hosted by the National University of Singapore.
131
+ Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.
132
+
133
+
134
+ ## Contact
135
+
136
+ For more info, please contact us using this [SEA-LION Inquiry Form](https://forms.gle/sLCUVb95wmGf43hi6)
137
+
138
+ [Link to SEA-LION's GitHub repository](https://github.com/aisingapore/sealion)
139
+
140
+
141
+ ## Disclaimer
142
+
143
+ This the repository for the base model.
144
+ The model has _not_ been aligned for safety.
145
+ Developers and users should perform their own safety fine-tuning and related security measures.
146
+ In no event shall the authors be held liable for any claim, damages, or other liability
147
+ arising from the use of the released weights and codes.
148
+
149
+
150
+ ## References
151
+ ### Thai Pre-Training Data Reference
152
+
153
+ ```bibtex
154
+ @misc{lowphansirikul2021wangchanberta,
155
+ title={WangchanBERTa: Pretraining transformer-based Thai Language Models},
156
+ author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
157
+ year={2021},
158
+ eprint={2101.09635},
159
+ archivePrefix={arXiv},
160
+ primaryClass={cs.CL}
161
+ }
162
+ ```