aisingapore
/

gemma2-9b-cpt-sea-lionv3-base

+---
+language:
+- en
+- id
+- ta
+- th
+- vi
+license: gemma
+---
+# Gemma2 9B CPT SEA-LIONv3
+SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
+This is the card for the Gemma2 9B CPT SEA-LIONv3 base model which has undergone continued pre-training from the base [Gemma-2-9B](https://huggingface.co/google/gemma-2-9b) model.
+SEA-LION stands for <i>Southeast Asian Languages In One Network</i>.
+## Model Details
+### Model Description
+The continued pre-training data for Gemma2 9B CPT SEA-LIONv3 base model encompasses approximately 48B tokens.
+- **Developed by:** Products Pillar, AI Singapore
+- **Funded by:** Singapore NRF
+- **Model type:** Decoder
+- **Languages:** English, Indonesian, Thai, Vietnamese, Tamil
+- **License:** [Gemma Community License](https://ai.google.dev/gemma/terms)
+For tokenization, the model employs the default tokenizer used in Gemma-2-9B.
+### Benchmark Performance
+We evaluated Gemma2 9B CPT SEA-LIONv3 base model on general language capabilities.
+#### General Language Capabilities
+For the evaluation of general language capabilities in SEA languages, we employed the [BHASA evaluation benchmark](https://arxiv.org/abs/2309.06085v2) across a variety of tasks.
+These tasks include Question Answering (QA), Sentiment Analysis (Sentiment), Toxicity Detection (Toxicity), Translation in both directions (Eng>Lang & Lang>Eng), Abstractive Summarization (Summ), Causal Reasoning (Causal) and Natural Language Inference (NLI).
+The evaluation was done **five-shot** with native prompts and only a sample of 100-1000 instances for each dataset was used as per the setting described in the paper.
+For more details on Gemma2 9B CPT SEA-LIONv3 base benchmark performance, please refer to the SEA HELM leaderboard, https://leaderboard.sea-lion.ai/
+## Training Details
+### Data
+Gemma2 9B CPT SEA-LIONv3 base model was continued pre-trained on 48B tokens of the following data:
+| Data Source               | Unique Tokens (B) | Multiplier | Total Tokens (B) | Percentage (%) |
+|---------------------------|:-----------------:|:----------:|:----------------:|:--------------:|
+| Dolma RefinedWeb - English|        7.650      |          1 |       7.650      |     15.90      |
+| Dolma C4 - English        |        1.160      |          1 |        1.16      |      9.21      |
+| Dolma Reddit - English    |        1.339      |          1 |       1.339      |      2.42      |
+| Dolma Semantic Scholar    |        0.959      |          1 |       0.959      |      2.79      |
+| Dolma arXiv               |        0.469      |          1 |       0.469      |      1.99      |
+| Dolma StarCoder           |        4.422      |          1 |       4.422      |      0.98      |
+| SEA-LION Pile - Indonesian|          3.4      |          2 |         6.8      |     14.17      |
+| Wiki* - Indonesian        |          0.3      |          4 |         1.2      |      2.50      |
+| SEA-LION Pile - Tamil     |          5.6      |          1 |         5.6      |     11.67      |
+| Wiki* + News - Tamil      |          0.6      |          4 |         2.4      |      5.00      |
+| SEA-LION Pile - Thai      |         2.28      |          1 |        2.28      |      4.75      |
+| WangChanBERTa - Thai      |            5      |          1 |           5      |     10.42      |
+| Wiki* - Thai              |         0.18      |          4 |        0.72      |      1.50      |
+| SEA-LION Pile - Vietnamese|         6.76      |          1 |        6.76      |     14.08      |
+| Wiki* - Vietnamese        |         0.31      |          4 |        1.24      |      2.58      |
+Note:
+- All token counts are counted using Gemma2 tokenizer
+- wiki* sources includes Wikipedia, Wiki Books, Wiki Source and Wiki Voyage
+- Tamil news is sourced with permission from [Seithi](https://seithi.mediacorp.sg/)
+### Infrastructure
+Gemma2 9B CPT SEA-LIONv3 was trained using [MosaicML Composer](https://github.com/mosaicml/composer)
+on the following hardware:
+| Training Details     | Gemma2 9B CPT SEA-LIONv3 |
+|----------------------|:--------------------:|
+| AWS EC2 p5d.24xlarge |          8 instances |
+| Nvidia H100 80GB GPU |          64          |
+| Training Duration    |          2 days      |
+### Configuration
+| HyperParameter    | Gemma2 9B CPT SEA-LIONv32 |
+|-------------------|:--------------------:|
+| Precision         | bfloat16             |
+| Optimizer         | decoupled_adamw      |
+| Scheduler         | weight_stable_decay  |
+| Learning Rate     | 1.0e-5               |
+| Global Batch Size | 512                  |
+| Micro Batch Size  | 2                    |
+## The Team
+Choa Esther<br>
+Cheng Nicholas<br>
+Huang Yuli<br>
+Lau Wayne<br>
+Lee Chwan Ren<br>
+Leong Wai Yi<br>
+Leong Wei Qi<br>
+Li Yier<br>
+Liu Bing Jie Darius<br>
+Lovenia Holy<br>
+Montalan Jann Railey<br>
+Ng Boon Cheong Raymond<br>
+Ngui Jian Gang<br>
+Nguyen Thanh Ngan<br>
+Ong Brandon<br>
+Ong Tat-Wee David<br>
+Ong Zhi Hao<br>
+Rengarajan Hamsawardhini<br>
+Siow Bryan<br>
+Susanto Yosephine<br>
+Tai Ngee Chia<br>
+Tan Choon Meng<br>
+Teo Eng Sipp Leslie<br>
+Teo Wei Yi<br>
+Tjhi William<br>
+Teng Walter<br>
+Yeo Yeow Tong<br>
+Yong Xianbin<br>
+## Acknowledgements
+AI Singapore is a national programme supported by the National Research Foundation, Singapore and hosted by the National University of Singapore.
+Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.
+## Contact
+For more info, please contact us using this [SEA-LION Inquiry Form](https://forms.gle/sLCUVb95wmGf43hi6)
+[Link to SEA-LION's GitHub repository](https://github.com/aisingapore/sealion)
+## Disclaimer
+This the repository for the base model.
+The model has _not_ been aligned for safety.
+Developers and users should perform their own safety fine-tuning and related security measures.
+In no event shall the authors be held liable for any claim, damages, or other liability
+arising from the use of the released weights and codes.
+## References
+### Thai Pre-Training Data Reference
+```bibtex
+@misc{lowphansirikul2021wangchanberta,
+    title={WangchanBERTa: Pretraining transformer-based Thai Language Models},
+    author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
+    year={2021},
+    eprint={2101.09635},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```