multilingual
sea
LidongBing commited on
Commit
2230da3
ยท
1 Parent(s): cd0e8a4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -18
README.md CHANGED
@@ -17,30 +17,30 @@ inference: false
17
  <a href="https://github.com/SeaLLMs/SeaLLMs" target="_blank" rel="noopener">Github</a>
18
  </p>
19
 
20
- We introduce SeaLLM - a family of language models optimized for Southeast Asian (SEA) languages. The SeaLLM-base models (to be released) were pre-trained from [Llama-2](https://huggingface.co/meta-llama/Llama-2-13b-hf), on a tailored publicly-available dataset, which comprises mainly Vietnamese ๐Ÿ‡ป๐Ÿ‡ณ, Indonesian ๐Ÿ‡ฎ๐Ÿ‡ฉ and Thai ๐Ÿ‡น๐Ÿ‡ญ texts, along with those in English ๐Ÿ‡ฌ๐Ÿ‡ง and Chinese ๐Ÿ‡จ๐Ÿ‡ณ. The pre-training stage involves multiple stages with dynamic data control to preserve the original knowledge base of Llama-2 while gaining new abilities in SEA languages.
21
 
22
- The [SeaLLM-chat](https://huggingface.co/spaces/SeaLLMs/SeaLLM-Chat-13b) model underwent supervised finetuning (SFT) on a mix of public instruction data (e.g. [OpenORCA](https://huggingface.co/datasets/Open-Orca/OpenOrca)) and a small amount of queries used by SEA language native speakers in natural settings, which **adapt to the local cultural norms, customs, styles and laws in these areas**, as well as other SFT enhancement techniques (to be revealed later).
23
 
24
- Our customized SFT process helps enhance our models' ability to understand, respond and serve communities whose languages are often neglected by previous [English-dominant LLMs](https://arxiv.org/abs/2307.09288), while outperforming existing polyglot LLMs, like [BLOOM](https://arxiv.org/abs/2211.05100) or [PolyLM](https://arxiv.org/pdf/2307.06018.pdf).
25
 
26
- Our [first released SeaLLM](https://huggingface.co/spaces/SeaLLMs/SeaLLM-Chat-13b) supports Vietnamese ๐Ÿ‡ป๐Ÿ‡ณ, Indonesian ๐Ÿ‡ฎ๐Ÿ‡ฉ and Thai ๐Ÿ‡น๐Ÿ‡ญ. Future versions endeavor to cover all languages spoken in Southeast Asia.
27
 
28
  - DEMO: [SeaLLMs/SeaLLM-Chat-13b](https://huggingface.co/spaces/SeaLLMs/SeaLLM-Chat-13b)
29
  - Model weights: To be released.
30
  - Technical report: To be released.
31
 
32
  <blockquote style="color:red">
33
- <p><strong style="color: red">Terms of Use</strong>: By using our released weights, codes and demos, you agree to and comply with the following terms and conditions:</p>
34
  <ul>
35
  <li>Follow LLama-2 <a rel="noopener nofollow" href="https://ai.meta.com/llama/license/">License</a> and <a rel="noopener nofollow" href="https://ai.meta.com/llama/use-policy/">Terms of Use</a>.</li>
36
- <li>Strictly comply with the local regulations from where you operate, and not attempt to generate or elicit content that are locally or internationally illegal and inappropriate from our models.</li>
37
  </ul>
38
  </blockquote>
39
 
40
  > **Disclaimer**:
41
- > We must note that even though the weights, codes and demos are released in an open manner, similar to other pre-trained language models, and despite our best effort in red teaming and safety finetuning and enforcement, our models come with potential risks. These risks are influenced by various complex factors, including but not limited to over-diversified, inaccurate, misleading or potentially harmful generation.
42
  > Developers and stakeholders should perform their own red teaming and provide related security measures before deployment, and they must abide by and comply with local governance and regulations.
43
- > In no event shall the authors be held liable for any claim, damages, or other liability arising from the use of the released weights, codes or demos.
44
 
45
  > The logo was generated by DALL-E 3.
46
 
@@ -49,11 +49,11 @@ The following sections summarize the [Pre-training](#pre-training), [Supervised-
49
  ## Pre-training
50
 
51
  ### Vocabulary Expansion
52
- Like many English/Latin-dominant LLMs, Llama-2's BPE tokenizer breaks non-European and non-latin linguistic texts into unsustainably long byte-level sequences that cover much shorter semantic meanings, leading to [degraded performance](https://arxiv.org/abs/2306.11372). For instance, it takes 4.3x more tokens to encode the same sentence in Thai compared to that in English. This leads to the models failing to perform summarization and comprehension tasks without exceeding the context length.
53
 
54
- Our goal for vocabulary expansion is threefold: (1) the number of newly-added tokens must be minimal and only cover the new languages, (2) the tokens should bring the compression ratios of new languages close to that of English, and (3) minimize the disruption of existing European tokens to preserve Llama-2 knowledge. In the end, we obtain **~11K** new tokens for Vi, Id, Th and Zh to augment the original 32000-token vocabulary. Details of our expansion technique will be revealed in our upcoming technical report.
55
 
56
- As seen in the table below, our new vocabulary reduce the compression ratio from 4.29 to 1.57 for Thai - meaning it can now encode 2.7x longer Thai text given the same context length. Meanwhile, English is only compressed by 0.3%, thus preserving its integrity.
57
 
58
  |Language | Llama's ratio | Our ratio | # New tokens
59
  | --- | --- | --- | --- |
@@ -70,9 +70,9 @@ As seen in the table below, our new vocabulary reduce the compression ratio from
70
 
71
  ### Pre-training Strategies
72
 
73
- We conduct pre-training in 4 different stages. Each stage serves a different specific objective and involves dynamic control of (unsupervised and supervised) data mixture, as well as data specification and categorization. We also employ a novel sequence construction and masking techniques during these stages. More details are to be provided in the technical report.
74
 
75
- As our goal is for Llama-2 to learn new languages with the least number tokens and computing resources, we control an appropriate data mix of new (Vi, Id & Th) and old (En, Zh) languages so that the new vocabulary and knowledge are trained quickly, while relatively maintaining the performance of the original Llama-2 model and establishing a knowledge bridge between new and existing languages.
76
 
77
  We pre-train our SeaLLM-base in ~4 weeks on 32gpus, clocking ~150B tokens.
78
 
@@ -82,7 +82,7 @@ We pre-train our SeaLLM-base in ~4 weeks on 32gpus, clocking ~150B tokens.
82
 
83
  Our supervised finetuning (SFT) data consists of many categories. The largest of them are public and open-source, such as [OpenORCA](https://huggingface.co/datasets/Open-Orca/OpenOrca) and [Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus). As the aforementioned are monolingual, we employ several established or novel automatic techniques to gather more instruction data for SEA languages.
84
 
85
- Even more noteworthy is that we engaged native speakers to collect a small amount of queries used by SEA-languages native speakers in natural settings, which helps in adaptation to the local cultural customs, norms and laws. We also collect country-relevant safety data that cover many culturally and legally sensitive topics in each of these SEA countries - such data tend to be ignored, or may even appear in conflict with western safety data. Therefore, we believe that our models are more local-friendly and abide by local rules to a higher degree.
86
 
87
  ### SFT Strategies
88
 
@@ -109,7 +109,7 @@ We use GPT-4 as an evaluator to rate the comparison between our models versus Ch
109
  ### M3Exam - World Knowledge in Regional Languages
110
 
111
 
112
- [M3Exam](https://arxiv.org/pdf/2306.05179.pdf) is a collection of real-life and native official human exam questions benchmarks. This benchmark cover questions from multiple countries in the SEA region, which require strong multilingual proficiency and cultural knowledge across various critical educational periods, from primary- to high-school levels of difficulty.
113
 
114
  As shown in the table, our SeaLLM model outperforms most 13B baselines and reaches closer to ChatGPT's performance. Notably, for Thai - a seemingly low-resource language, our model is just 1% behind ChatGPT despite the large size difference.
115
 
@@ -162,7 +162,7 @@ As shown in the table below, the 1-shot reading comprehension performance is sig
162
 
163
  For translation tasks, we evaluate our models with the [FloRes-200](https://github.com/facebookresearch/flores/blob/main/flores200/README.md) using [chrF++](https://aclanthology.org/W15-3049/) scores in 4-shot settings.
164
 
165
- Similarly observed, our SeaLLM models outperforms Llama-2 significantly in the new languages.
166
 
167
  | FloRes-200 (chrF++) | En-Zh | En-Vi | En-Id | En-Th | En->X | Zh-En | Vi-En | Id-En | Th-En | X->En
168
  |-------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
@@ -171,7 +171,7 @@ Similarly observed, our SeaLLM models outperforms Llama-2 significantly in the n
171
  | SeaLLM-13b-chat-v1 | 22.77 | 58.96 | 64.78 | 42.38 | 55.37 | 53.20 | 60.29 | 65.03 | 57.24 | 60.85
172
  | SeaLLM-13b-chat-v2 | 22.75 | 58.78 | 65.90 | 42.60 | 55.76 | 53.34 | 60.80 | 65.44 | 57.05 | 61.10
173
 
174
- Our models are also performing competitively with ChatGPT for translation between SEA languages without English-pivoting.
175
 
176
  | FloRes-200 (chrF++) | Vi-Id | Id-Vi | Vi-Th | Th-Vi | Id-Th | Th-Id
177
  |-------- | ---- | ---- | ---- | ---- | ---- | ---- |
@@ -181,7 +181,7 @@ Our models are also performing competitively with ChatGPT for translation betwee
181
 
182
  #### Summarization
183
 
184
- Lastly, in 2-shot [XL-sum summarization tasks](https://aclanthology.org/2021.findings-acl.413/), our models also achieve better performance, with substantial gains in Thai.
185
 
186
  | XL-Sum (rouge-L) | En | Zh | Vi | Id | Th
187
  |-------- | ---- | ---- | ---- | ---- | ---- |
@@ -189,6 +189,7 @@ Lastly, in 2-shot [XL-sum summarization tasks](https://aclanthology.org/2021.fin
189
  | Llama-2-13b-chat | 25.11 | 31.13 | 18.29 | 22.45 | 17.51
190
  | SeaLLM-13b-chat-v2 | 27.00 | 33.31 | 20.31 | 25.69 | 21.97
191
 
 
192
 
193
  ## Citation
194
 
 
17
  <a href="https://github.com/SeaLLMs/SeaLLMs" target="_blank" rel="noopener">Github</a>
18
  </p>
19
 
20
+ We introduce SeaLLMs - a family of language models optimized for Southeast Asian (SEA) languages. The SeaLLM-base models (to be released) were pre-trained from [Llama-2](https://huggingface.co/meta-llama/Llama-2-13b-hf), on a tailored publicly-available dataset, which comprises mainly Vietnamese ๐Ÿ‡ป๐Ÿ‡ณ, Indonesian ๐Ÿ‡ฎ๐Ÿ‡ฉ and Thai ๐Ÿ‡น๐Ÿ‡ญ texts, along with those in English ๐Ÿ‡ฌ๐Ÿ‡ง and Chinese ๐Ÿ‡จ๐Ÿ‡ณ. The pre-training stage involves multiple stages with dynamic data control to preserve the original knowledge base of Llama-2 while gaining new abilities in SEA languages.
21
 
22
+ The [SeaLLM-chat](https://huggingface.co/spaces/SeaLLMs/SeaLLM-Chat-13b) model underwent supervised finetuning (SFT) on a mix of public instruction data (e.g. [OpenORCA](https://huggingface.co/datasets/Open-Orca/OpenOrca)) and a small number of queries used by SEA language native speakers in natural settings, which **adapt to the local cultural norms, customs, styles and laws in these areas**, as well as other SFT enhancement techniques (to be revealed later).
23
 
24
+ Our customized SFT process helps enhance our models' ability to understand, respond, and serve communities whose languages are often neglected by previous [English-dominant LLMs](https://arxiv.org/abs/2307.09288), while outperforming existing polyglot LLMs, like [BLOOM](https://arxiv.org/abs/2211.05100) or [PolyLM](https://arxiv.org/pdf/2307.06018.pdf).
25
 
26
+ Our [first released SeaLLM](https://huggingface.co/spaces/SeaLLMs/SeaLLM-Chat-13b) supports Vietnamese ๐Ÿ‡ป๐Ÿ‡ณ, Indonesian ๐Ÿ‡ฎ๐Ÿ‡ฉ, and Thai ๐Ÿ‡น๐Ÿ‡ญ. Future versions endeavor to cover all languages spoken in Southeast Asia.
27
 
28
  - DEMO: [SeaLLMs/SeaLLM-Chat-13b](https://huggingface.co/spaces/SeaLLMs/SeaLLM-Chat-13b)
29
  - Model weights: To be released.
30
  - Technical report: To be released.
31
 
32
  <blockquote style="color:red">
33
+ <p><strong style="color: red">Terms of Use</strong>: By using our released weights, codes, and demos, you agree to and comply with the following terms and conditions:</p>
34
  <ul>
35
  <li>Follow LLama-2 <a rel="noopener nofollow" href="https://ai.meta.com/llama/license/">License</a> and <a rel="noopener nofollow" href="https://ai.meta.com/llama/use-policy/">Terms of Use</a>.</li>
36
+ <li>Strictly comply with the local regulations from where you operate, and do not attempt to generate or elicit content that is locally or internationally illegal and inappropriate from our models.</li>
37
  </ul>
38
  </blockquote>
39
 
40
  > **Disclaimer**:
41
+ > We must note that even though the weights, codes, and demos are released in an open manner, similar to other pre-trained language models, and despite our best efforts in red teaming and safety finetuning and enforcement, our models come with potential risks. These risks are influenced by various complex factors, including but not limited to over-diversified, inaccurate, misleading or potentially harmful generation.
42
  > Developers and stakeholders should perform their own red teaming and provide related security measures before deployment, and they must abide by and comply with local governance and regulations.
43
+ > In no event shall the authors be held liable for any claim, damages, or other liability arising from the use of the released weights, codes, or demos.
44
 
45
  > The logo was generated by DALL-E 3.
46
 
 
49
  ## Pre-training
50
 
51
  ### Vocabulary Expansion
52
+ Like many English/Latin-dominant LLMs, Llama-2's BPE tokenizer breaks non-European and non-Latin linguistic texts into unsustainably long byte-level sequences that cover much shorter semantic meanings, leading to [degraded performance](https://arxiv.org/abs/2306.11372). For instance, it takes 4.3x more tokens to encode the same sentence in Thai compared to that in English. This leads to the models failing to perform summarization and comprehension tasks without exceeding the context length.
53
 
54
+ Our goal for vocabulary expansion is threefold: (1) the number of newly-added tokens must be minimal and only cover the new languages, (2) the tokens should bring the compression ratios of new languages close to that of English, and (3) minimize the disruption of existing European tokens to preserve Llama-2 knowledge. In the end, we obtain **~11K** new tokens for Vi, Id, Th, and Zh to augment the original 32000-token vocabulary. Details of our expansion technique will be revealed in our upcoming technical report.
55
 
56
+ As seen in the table below, our new vocabulary reduces the compression ratio from 4.29 to 1.57 for Thai - meaning it can now encode 2.7x longer Thai text given the same context length. Meanwhile, English is only compressed by 0.3%, thus preserving its integrity.
57
 
58
  |Language | Llama's ratio | Our ratio | # New tokens
59
  | --- | --- | --- | --- |
 
70
 
71
  ### Pre-training Strategies
72
 
73
+ We conduct pre-training in 4 different stages. Each stage serves a different specific objective and involves dynamic control of (unsupervised and supervised) data mixture, as well as data specification and categorization. We also employ novel sequence construction and masking techniques during these stages. More details are to be provided in the technical report.
74
 
75
+ As our goal is for Llama-2 to learn new languages with the least number of tokens and computing resources, we control an appropriate data mix of new (Vi, Id & Th) and old (En, Zh) languages so that the new vocabulary and knowledge are trained quickly, while relatively maintaining the performance of the original Llama-2 model and establishing a knowledge bridge between new and existing languages.
76
 
77
  We pre-train our SeaLLM-base in ~4 weeks on 32gpus, clocking ~150B tokens.
78
 
 
82
 
83
  Our supervised finetuning (SFT) data consists of many categories. The largest of them are public and open-source, such as [OpenORCA](https://huggingface.co/datasets/Open-Orca/OpenOrca) and [Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus). As the aforementioned are monolingual, we employ several established or novel automatic techniques to gather more instruction data for SEA languages.
84
 
85
+ Even more noteworthy is that we engaged native speakers to collect a small number of queries used by SEA-language native speakers in natural settings, which helps in adaptation to the local cultural customs, norms, and laws. We also collect country-relevant safety data that cover many culturally and legally sensitive topics in each of these SEA countries - such data tend to be ignored, or may even appear in conflict with Western safety data. Therefore, we believe that our models are more local-friendly and abide by local rules to a higher degree.
86
 
87
  ### SFT Strategies
88
 
 
109
  ### M3Exam - World Knowledge in Regional Languages
110
 
111
 
112
+ [M3Exam](https://arxiv.org/pdf/2306.05179.pdf) is a collection of real-life and native official human exam question benchmarks. This benchmark covers questions from multiple countries in the SEA region, which require strong multilingual proficiency and cultural knowledge across various critical educational periods, from primary- to high-school levels of difficulty.
113
 
114
  As shown in the table, our SeaLLM model outperforms most 13B baselines and reaches closer to ChatGPT's performance. Notably, for Thai - a seemingly low-resource language, our model is just 1% behind ChatGPT despite the large size difference.
115
 
 
162
 
163
  For translation tasks, we evaluate our models with the [FloRes-200](https://github.com/facebookresearch/flores/blob/main/flores200/README.md) using [chrF++](https://aclanthology.org/W15-3049/) scores in 4-shot settings.
164
 
165
+ Similarly observed, our SeaLLM models outperform Llama-2 significantly in the new languages.
166
 
167
  | FloRes-200 (chrF++) | En-Zh | En-Vi | En-Id | En-Th | En->X | Zh-En | Vi-En | Id-En | Th-En | X->En
168
  |-------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
 
171
  | SeaLLM-13b-chat-v1 | 22.77 | 58.96 | 64.78 | 42.38 | 55.37 | 53.20 | 60.29 | 65.03 | 57.24 | 60.85
172
  | SeaLLM-13b-chat-v2 | 22.75 | 58.78 | 65.90 | 42.60 | 55.76 | 53.34 | 60.80 | 65.44 | 57.05 | 61.10
173
 
174
+ Our models are also performing competitively with ChatGPT for translation between SEA languages without English pivoting.
175
 
176
  | FloRes-200 (chrF++) | Vi-Id | Id-Vi | Vi-Th | Th-Vi | Id-Th | Th-Id
177
  |-------- | ---- | ---- | ---- | ---- | ---- | ---- |
 
181
 
182
  #### Summarization
183
 
184
+ Lastly, in 2-shot [XL-sum summarization tasks](https://aclanthology.org/2021.findings-acl.413/), our models also achieve a better performance, with substantial gains in Thai.
185
 
186
  | XL-Sum (rouge-L) | En | Zh | Vi | Id | Th
187
  |-------- | ---- | ---- | ---- | ---- | ---- |
 
189
  | Llama-2-13b-chat | 25.11 | 31.13 | 18.29 | 22.45 | 17.51
190
  | SeaLLM-13b-chat-v2 | 27.00 | 33.31 | 20.31 | 25.69 | 21.97
191
 
192
+ ## Acknowledge our linguists
193
 
194
  ## Citation
195