aisingapore
/

sea-lion-7b

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

sea-lion-7b / README.md

weiqipedia's picture

Minor fixes to README.md

052eb88 9 months ago

|

3.98 kB

	---
	license: mit
	---
	# SEA-LION

	SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
	The size of the models range from 3 billion to 7 billion parameters.
	This is the card for the SEA-LION 7B model.

	SEA-LION stands for <i>Southeast Asian Languages In One Network</i>.


	## Model Details

	### Model Description

	The SEA-LION model is a significant leap forward in the field of Natural Language Processing,
	specifically trained to understand the SEA regional context.

	SEA-LION is built on the robust MPT architecture and has a vocabulary size of 256K.

	For tokenization, the model employs our custom SEABPETokenizer, which is specially tailored for SEA languages, ensuring optimal model performance.

	The training data for SEA-LION encompasses 980B tokens.

	- Developed by: Products Pillar, AI Singapore
	- Funded by: Singapore NRF
	- Model type: Decoder
	- Languages: English, Chinese, Indonesian, Malay, Thai, Vietnamese, Filipino, Tamil, Burmese, Khmer, Lao
	- License: MIT License


	## Training Details

	### Data

	SEA-LION was trained on 980B tokens of the following data:

	\| Data Source \| Tokens \| Percentage \|
	\|---------------------------\|-------:\|:----------:\|
	\| RefinedWeb - English \| 571.3B \| 58.20% \|
	\| mC4 - Chinese \| 91.2B \| 9.29% \|
	\| mC4 - Indonesian \| 14.7B \| 1.50% \|
	\| mC4 - Malay \| 2.9B \| 0.29% \|
	\| mC4 - Filipino \| 5.3B \| 0.54% \|
	\| mC4 - Burmese \| 4.9B \| 0.49% \|
	\| mC4 - Vietnamese \| 63.4B \| 6.46% \|
	\| mC4 - Thai \| 21.6B \| 2.20% \|
	\| mC4 - Lao \| 1.1B \| 0.12% \|
	\| mC4 - Khmer \| 3.9B \| 0.40% \|
	\| mC4 - Tamil \| 10.2B \| 1.04% \|
	\| the Stack - Python \| 41.8B \| 4.26% \|
	\| the Stack - Javascript \| 55.6B \| 5.66% \|
	\| the Stack - Shell \| 2.5B \| 0.26% \|
	\| the Stack - SQL \| 12.8B \| 1.31% \|
	\| the Stack - Markdown \| 26.6B \| 2.71% \|
	\| RedPajama - StackExchange \| 21.2B \| 2.16% \|
	\| RedPajama - ArXiv \| 30.6B \| 3.12% \|

	### Infrastructure

	SEA-LION was trained using [MosaicML Composer](https://github.com/mosaicml/composer)
	on the following hardware:

	\| Training Details \| SEA-LION 7B \|
	\|----------------------\|:------------:\|
	\| AWS EC2 p4d.24xlarge \| 32 instances \|
	\| Nvidia A100 40GB GPU \| 256 \|
	\| Training Duration \| 22 days \|


	### Configuration

	\| HyperParameter \| SEA-LION 7B \|
	\|-------------------\|:------------------:\|
	\| Precision \| bfloat16 \|
	\| Optimizer \| decoupled_adamw \|
	\| Scheduler \| cosine_with_warmup \|
	\| Learning Rate \| 6.0e-5 \|
	\| Global Batch Size \| 2048 \|
	\| Micro Batch Size \| 4 \|


	## Technical Specifications

	### Model Architecture and Objective

	SEA-LION is a decoder model using the MPT architecture.

	\| Parameter \| SEA-LION 7B \|
	\|-----------------\|:-----------:\|
	\| Layers \| 32 \|
	\| d_model \| 4096 \|
	\| head_dim \| 32 \|
	\| Vocabulary \| 256000 \|
	\| Sequence Length \| 2048 \|


	### Tokenizer Details

	We sample 20M lines from the training data to train the tokenizer.<br>
	The framework for training is [SentencePiece](https://github.com/google/sentencepiece).<br>
	The tokenizer type is Byte-Pair Encoding (BPE).



	## The Team

	Lam Zhiwen Clarence<br>
	Leong Wei Qi<br>
	Li Yier<br>
	Liu Darius<br>
	Lovenia Holy<br>
	Montalan Jann Railey<br>
	Ng Raymond<br>
	Ngui Jian Gang<br>
	Nguyen Thanh Ngan<br>
	Ong Tat-Wee David<br>
	Rengarajan Hamsawardhini<br>
	Susanto Yosephine<br>
	Tai Ngee Chia<br>
	Tan Choon Meng<br>
	Teo Jin Howe<br>
	Teo Leslie<br>
	Teo Wei Yi<br>
	Tjhi William<br>
	Yeo Yeow Tong<br>
	Yong Xianbin<br>

	## Contact

	For more info, please contact us at seallm@aisingapore.org