File size: 3,975 Bytes

db792f0
 
 
7aaed1e
 
bd15cf5
7aaed1e
 
 
bd15cf5
7aaed1e
 
 
 
 
 
bfd3814
bd15cf5
7aaed1e
 
 
62f1dde
7aaed1e
 
 
 
 
 
 
6c9ccc1
7aaed1e
 
 
 
 
 
 
 
 
 
 
b426412
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7aaed1e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b426412
7aaed1e
 
b426412
7aaed1e
b426412
 
7aaed1e
 
 
b426412
7aaed1e
b426412
7aaed1e

---
license: mit
---
# SEA-LION

SEA-LION is a collection of LLMs which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
The models range from 3 billion to 7 billion parameters.
This is the card for the SEA-LION 7B model.

SEA-LION stands for <i>Southeast Asia Languages In One Network</i>.


## Model Details

### Model Description

The SEA-LION model is a significant leap forward in the field of natural language processing,
specifically trained to understand Southeast Asia (SEA) regional context.

SEA-LION is built on the robust MPT architecture and utilize a vocabulary size of 256K.

The model employs our custom SEABPETokenizer for tokenization.
Our SEABPETokenizer is specially tailored for SEA languages, ensuring optimal model performance.

The training data for SEA-LION encompasses 980B tokens.

- **Developed by:** Products Pillar, AI Singapore
- **Funded by:** Singapore NRF
- **Model type:** Decoder
- **Languages:** English, Chinese, Indonesian, Malay, Thai, Vietnamese, Filipino, Tamil, Burmese, Khmer, Lao
- **License:** MIT License


## Training Details

### Data

SEA-LION was trained on 980B tokens of the following data:

| Data Source               | Tokens | Percentage |
|---------------------------|-------:|:----------:|
| RefinedWeb - English      | 571.3B |     58.20% |
| mC4 - Chinese             |  91.2B |      9.29% |
| mC4 - Indonesian          |  14.7B |      1.50% |
| mC4 - Malay               |   2.9B |      0.29% |
| mC4 - Filipino            |   5.3B |      0.54% |
| mC4 - Burmese             |   1.2B |      0.49% |
| mC4 - Vietnamese          |  63.4B |      6.46% |
| mC4 - Thai                |  21.6B |      2.20% |
| mC4 - Lao                 |   1.1B |      0.12% |
| mC4 - Khmer               |   3.9B |      0.40% |
| mC4 - Tamil               |  10.2B |      1.04% |
| the Stack - Python        |  41.8B |      4.26% |
| the Stack - Javascript    |  55.6B |      5.66% |
| the Stack - Shell         |   2.5B |      0.26% |
| the Stack - SQL           |  12.8B |      1.31% |
| the Stack - Markdown      |  26.6B |      2.71% |
| RedPajama - StackExchange |  21.2B |      2.16% |
| RedPajama - ArXiv         |  30.6B |      3.12% |

### Infrastructure

SEA-LION was trained using [MosaicML Composer](https://github.com/mosaicml/composer)
on the following hardware:

| Training Details     | SEA-LION 7B  |
|----------------------|:------------:|
| AWS EC2 p4d.24xlarge | 32 instances |
| Nvidia A100 40GB GPU | 256          |
| Training Duration    | 22 days      |


### Configuration

| HyperParameter    | SEA-LION 7B        |
|-------------------|:------------------:|
| Precision         | bfloat16           |
| Optimizer         | decoupled_adamw    |
| Scheduler         | cosine_with_warmup |
| Learning Rate     | 6.0e-5             |
| Global Batch Size | 2048               |
| Micro Batch Size  | 4                  |


## Technical Specifications

### Model Architecture and Objective

SEA-LION is a decoder model using the MPT architecture.

| Parameter       | SEA-LION 7B |
|-----------------|:-----------:|
| Layers          | 32          |
| d_model         | 4096        |
| head_dim        | 32          |
| Vocabulary      | 256000      |
| Sequence Length | 2048        |


### Tokenizer Details

We sample 20M lines from the training data to train the tokenizer.<br>
The framework for training is [SentencePiece](https://github.com/google/sentencepiece).<br>
The tokenizer type is Byte-Pair Encoding (BPE).



## The Team

Lam Zhiwen Clarence<br>
Leong Weiqi<br>
Li Yier<br>
Liu Darius<br>
Lovenia Holy<br>
Montalan Jann Railey<br>
Ng Raymond<br>
Ngui Jian Gang<br>
Nguyen Ngan Thanh<br>
Ong Tat-Wee David<br>
Rengarajan Hamsawardhini<br>
Susanto Yosephine<br>
Tai Ngee Chia<br>
Tan Choon Meng<br>
Teo Jin Howe<br>
Teo Leslie<br>
Teo Wei Yi<br>
Tjhi William<br>
Yeo Yeow Tong<br>
Yong Xianbin<br>

## Contact

For more info, please contact us at seallm@aisingapore.org