rizputra commited on
Commit
1101cb9
·
verified ·
1 Parent(s): f1eb5f7

Upload folder using huggingface_hub

Browse files
.ipynb_checkpoints/README-checkpoint.md ADDED
@@ -0,0 +1,175 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+ This is a version of the [sealion7b](https://huggingface.co/aisingapore/sealion7b) model, sharded to 2 GB chunks.
6
+
7
+ Please refer to the previously linked repo for details on usage/implementation/etc. This model was downloaded from the original repo and is redistributed under the same license.
8
+
9
+ # SEA-LION
10
+
11
+ SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
12
+ The size of the models range from 3 billion to 7 billion parameters.
13
+ This is the card for the SEA-LION 7B base model.
14
+
15
+ SEA-LION stands for <i>Southeast Asian Languages In One Network</i>.
16
+
17
+
18
+ ## Model Details
19
+
20
+ ### Model Description
21
+
22
+ The SEA-LION model is a significant leap forward in the field of Natural Language Processing,
23
+ specifically trained to understand the SEA regional context.
24
+
25
+ SEA-LION is built on the robust MPT architecture and has a vocabulary size of 256K.
26
+
27
+ For tokenization, the model employs our custom SEABPETokenizer, which is specially tailored for SEA languages, ensuring optimal model performance.
28
+
29
+ The training data for SEA-LION encompasses 980B tokens.
30
+
31
+ - **Developed by:** Products Pillar, AI Singapore
32
+ - **Funded by:** Singapore NRF
33
+ - **Model type:** Decoder
34
+ - **Languages:** English, Chinese, Indonesian, Malay, Thai, Vietnamese, Filipino, Tamil, Burmese, Khmer, Lao
35
+ - **License:** MIT License
36
+
37
+ ### Performance Benchmarks
38
+
39
+ SEA-LION has an average performance on general tasks in English (as measured by Hugging Face's LLM Leaderboard):
40
+
41
+ | Model | ARC | HellaSwag | MMLU | TruthfulQA | Average |
42
+ |-------------|:-----:|:---------:|:-----:|:----------:|:-------:|
43
+ | SEA-LION 7B | 39.93 | 68.51 | 26.87 | 35.09 | 42.60 |
44
+
45
+ ## Training Details
46
+
47
+ ### Data
48
+
49
+ SEA-LION was trained on 980B tokens of the following data:
50
+
51
+ | Data Source | Unique Tokens | Multiplier | Total Tokens | Percentage |
52
+ |---------------------------|:-------------:|:----------:|:------------:|:----------:|
53
+ | RefinedWeb - English | 571.3B | 1 | 571.3B | 58.20% |
54
+ | mC4 - Chinese | 91.2B | 1 | 91.2B | 9.29% |
55
+ | mC4 - Indonesian | 3.68B | 4 | 14.7B | 1.50% |
56
+ | mC4 - Malay | 0.72B | 4 | 2.9B | 0.29% |
57
+ | mC4 - Filipino | 1.32B | 4 | 5.3B | 0.54% |
58
+ | mC4 - Burmese | 1.2B | 4 | 4.9B | 0.49% |
59
+ | mC4 - Vietnamese | 63.4B | 1 | 63.4B | 6.46% |
60
+ | mC4 - Thai | 5.8B | 2 | 11.6B | 1.18% |
61
+ | WangChanBERTa - Thai | 5B | 2 | 10B | 1.02% |
62
+ | mC4 - Lao | 0.27B | 4 | 1.1B | 0.12% |
63
+ | mC4 - Khmer | 0.97B | 4 | 3.9B | 0.40% |
64
+ | mC4 - Tamil | 2.55B | 4 | 10.2B | 1.04% |
65
+ | the Stack - Python | 20.9B | 2 | 41.8B | 4.26% |
66
+ | the Stack - Javascript | 55.6B | 1 | 55.6B | 5.66% |
67
+ | the Stack - Shell | 1.2B5 | 2 | 2.5B | 0.26% |
68
+ | the Stack - SQL | 6.4B | 2 | 12.8B | 1.31% |
69
+ | the Stack - Markdown | 26.6B | 1 | 26.6B | 2.71% |
70
+ | RedPajama - StackExchange | 21.2B | 1 | 21.2B | 2.16% |
71
+ | RedPajama - ArXiv | 30.6B | 1 | 30.6B | 3.12% |
72
+
73
+ ### Infrastructure
74
+
75
+ SEA-LION was trained using [MosaicML Composer](https://github.com/mosaicml/composer)
76
+ on the following hardware:
77
+
78
+ | Training Details | SEA-LION 7B |
79
+ |----------------------|:------------:|
80
+ | AWS EC2 p4d.24xlarge | 32 instances |
81
+ | Nvidia A100 40GB GPU | 256 |
82
+ | Training Duration | 22 days |
83
+
84
+
85
+ ### Configuration
86
+
87
+ | HyperParameter | SEA-LION 7B |
88
+ |-------------------|:------------------:|
89
+ | Precision | bfloat16 |
90
+ | Optimizer | decoupled_adamw |
91
+ | Scheduler | cosine_with_warmup |
92
+ | Learning Rate | 6.0e-5 |
93
+ | Global Batch Size | 2048 |
94
+ | Micro Batch Size | 4 |
95
+
96
+
97
+ ## Technical Specifications
98
+
99
+ ### Model Architecture and Objective
100
+
101
+ SEA-LION is a decoder model using the MPT architecture.
102
+
103
+ | Parameter | SEA-LION 7B |
104
+ |-----------------|:-----------:|
105
+ | Layers | 32 |
106
+ | d_model | 4096 |
107
+ | head_dim | 32 |
108
+ | Vocabulary | 256000 |
109
+ | Sequence Length | 2048 |
110
+
111
+
112
+ ### Tokenizer Details
113
+
114
+ We sample 20M lines from the training data to train the tokenizer.<br>
115
+ The framework for training is [SentencePiece](https://github.com/google/sentencepiece).<br>
116
+ The tokenizer type is Byte-Pair Encoding (BPE).
117
+
118
+
119
+
120
+ ## The Team
121
+
122
+ Lam Wen Zhi Clarence<br>
123
+ Leong Wei Qi<br>
124
+ Li Yier<br>
125
+ Liu Bing Jie Darius<br>
126
+ Lovenia Holy<br>
127
+ Montalan Jann Railey<br>
128
+ Ng Boon Cheong Raymond<br>
129
+ Ngui Jian Gang<br>
130
+ Nguyen Thanh Ngan<br>
131
+ Ong Tat-Wee David<br>
132
+ Rengarajan Hamsawardhini<br>
133
+ Susanto Yosephine<br>
134
+ Tai Ngee Chia<br>
135
+ Tan Choon Meng<br>
136
+ Teo Jin Howe<br>
137
+ Teo Eng Sipp Leslie<br>
138
+ Teo Wei Yi<br>
139
+ Tjhi William<br>
140
+ Yeo Yeow Tong<br>
141
+ Yong Xianbin<br>
142
+
143
+ ## Acknowledgements
144
+
145
+ AI Singapore is a national programme supported by the National Research Foundation, Singapore and hosted by the National University of Singapore.
146
+ Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.
147
+
148
+ ## Contact
149
+
150
+ For more info, please contact us using this [SEA-LION Inquiry Form](https://forms.gle/sLCUVb95wmGf43hi6)
151
+
152
+ [Link to SEA-LION's GitHub repository](https://github.com/aisingapore/sealion)
153
+
154
+
155
+ ## Disclaimer
156
+
157
+ This the repository for the base model.
158
+ The model has _not_ been aligned for safety.
159
+ Developers and users should perform their own safety fine-tuning and related security measures.
160
+ In no event shall the authors be held liable for any claim, damages, or other liability
161
+ arising from the use of the released weights and codes.
162
+
163
+
164
+ ## References
165
+
166
+ ```bibtex
167
+ @misc{lowphansirikul2021wangchanberta,
168
+ title={WangchanBERTa: Pretraining transformer-based Thai Language Models},
169
+ author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
170
+ year={2021},
171
+ eprint={2101.09635},
172
+ archivePrefix={arXiv},
173
+ primaryClass={cs.CL}
174
+ }
175
+ ```
.ipynb_checkpoints/adapt_tokenizer-checkpoint.py ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Any
2
+ from transformers import AutoTokenizer, PreTrainedTokenizerBase
3
+ NUM_SENTINEL_TOKENS: int = 100
4
+
5
+ def adapt_tokenizer_for_denoising(tokenizer: PreTrainedTokenizerBase) -> None:
6
+ """Adds sentinel tokens and padding token (if missing).
7
+
8
+ Expands the tokenizer vocabulary to include sentinel tokens
9
+ used in mixture-of-denoiser tasks as well as a padding token.
10
+
11
+ All added tokens are added as special tokens. No tokens are
12
+ added if sentinel tokens and padding token already exist.
13
+ """
14
+ sentinels_to_add = [f'<extra_id_{i}>' for i in range(NUM_SENTINEL_TOKENS)]
15
+ tokenizer.add_tokens(sentinels_to_add, special_tokens=True)
16
+ if tokenizer.pad_token is None:
17
+ tokenizer.add_tokens('<pad>', special_tokens=True)
18
+ tokenizer.pad_token = '<pad>'
19
+ assert tokenizer.pad_token_id is not None
20
+ sentinels = ''.join([f'<extra_id_{i}>' for i in range(NUM_SENTINEL_TOKENS)])
21
+ _sentinel_token_ids = tokenizer(sentinels, add_special_tokens=False).input_ids
22
+ tokenizer.sentinel_token_ids = _sentinel_token_ids
23
+
24
+ class AutoTokenizerForMOD(AutoTokenizer):
25
+ """AutoTokenizer + Adaptation for MOD.
26
+
27
+ A simple wrapper around AutoTokenizer to make instantiating
28
+ an MOD-adapted tokenizer a bit easier.
29
+
30
+ MOD-adapted tokenizers have sentinel tokens (e.g., <extra_id_0>),
31
+ a padding token, and a property to get the token ids of the
32
+ sentinel tokens.
33
+ """
34
+
35
+ @classmethod
36
+ def from_pretrained(cls, *args: Any, **kwargs: Any) -> PreTrainedTokenizerBase:
37
+ """See `AutoTokenizer.from_pretrained` docstring."""
38
+ tokenizer = super().from_pretrained(*args, **kwargs)
39
+ adapt_tokenizer_for_denoising(tokenizer)
40
+ return tokenizer
.ipynb_checkpoints/special_tokens_map-checkpoint.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "eos_token": "<|endoftext|>",
3
+ "unk_token": "<unk>"
4
+ }
.ipynb_checkpoints/tokenizer-checkpoint.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d3243fc67ced759a4adcca01c0356f5b722057158e99d3cb9502c2572dbda0cf
3
+ size 132
.ipynb_checkpoints/tokenizer_config-checkpoint.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_eos_token": false,
4
+ "added_tokens_decoder": {
5
+ "0": {
6
+ "content": "<unk>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "1": {
14
+ "content": "<|endoftext|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ }
21
+ },
22
+ "auto_map": {
23
+ "AutoTokenizer": ["tokenization_SEA_BPE.SEABPETokenizer", null]
24
+ },
25
+ "bos_token": null,
26
+ "clean_up_tokenization_spaces": false,
27
+ "eos_token": "<|endoftext|>",
28
+ "legacy": true,
29
+ "model_max_length": 1000000000000000019884624838656,
30
+ "pad_token": null,
31
+ "sp_model_kwargs": {},
32
+ "tokenizer_class": "SEABPETokenizer",
33
+ "unk_token": "<unk>"
34
+ }
tokenizer.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:d3243fc67ced759a4adcca01c0356f5b722057158e99d3cb9502c2572dbda0cf
3
- size 132
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c0c576972c98fa150efff77f61a30b46afbc1247ff4697f39e51e90d0a8b2190
3
+ size 4569957
tokenizer_config.json CHANGED
@@ -20,7 +20,10 @@
20
  }
21
  },
22
  "auto_map": {
23
- "AutoTokenizer": ["tokenization_SEA_BPE.SEABPETokenizer", null]
 
 
 
24
  },
25
  "bos_token": null,
26
  "clean_up_tokenization_spaces": false,
 
20
  }
21
  },
22
  "auto_map": {
23
+ "AutoTokenizer": [
24
+ "aisingapore/sealion7b--tokenization_SEA_BPE.SEABPETokenizer",
25
+ null
26
+ ]
27
  },
28
  "bos_token": null,
29
  "clean_up_tokenization_spaces": false,