Update README.md
Browse files
README.md
CHANGED
@@ -37,15 +37,12 @@ Our [first released SeaLLM](https://huggingface.co/spaces/SeaLLMs/SeaLLM-Chat-13
|
|
37 |
- Technical report: To be released.
|
38 |
|
39 |
<blockquote style="color:red">
|
40 |
-
<p><strong style="color: red">Terms of Use</strong>:
|
41 |
-
|
42 |
-
<li>Follow LLama-2 <a rel="noopener nofollow" href="https://ai.meta.com/llama/license/">License</a> and <a rel="noopener nofollow" href="https://ai.meta.com/llama/use-policy/">Terms of Use</a>.</li>
|
43 |
-
<li>Strictly comply with the local regulations from where you operate, and do not attempt to generate or elicit content that is locally or internationally illegal and inappropriate from our models.</li>
|
44 |
-
</ul>
|
45 |
</blockquote>
|
46 |
|
47 |
> **Disclaimer**:
|
48 |
-
> We must note that even though the weights, codes, and demos are released in an open manner, similar to other pre-trained language models, and despite our best efforts in red teaming and safety finetuning and enforcement, our models come with potential risks. These risks are influenced by various complex factors, including but not limited to
|
49 |
> Developers and stakeholders should perform their own red teaming and provide related security measures before deployment, and they must abide by and comply with local governance and regulations.
|
50 |
> In no event shall the authors be held liable for any claim, damages, or other liability arising from the use of the released weights, codes, or demos.
|
51 |
|
@@ -108,24 +105,32 @@ One of the most reliable ways to compare chatbot models is peer comparison.
|
|
108 |
With the help of native speakers, we built an instruction test set that focuses on various aspects expected in a user-facing chatbot, namely:
|
109 |
(1) NLP tasks (e.g. translation & comprehension), (2) Reasoning, (3) Instruction-following and
|
110 |
(4) Natural and Informal questions. The test set also covers all languages that we are concerned with.
|
111 |
-
|
112 |
We use GPT-4 as an evaluator to rate the comparison between our models versus ChatGPT-3.5 and other baselines.
|
113 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
114 |
<img src="seallm_vs_chatgpt_by_lang.png" width="800" />
|
115 |
|
116 |
<img src="seallm_vs_chatgpt_by_cat_sea.png" width="800" />
|
117 |
|
118 |
|
119 |
-
<img src="seallm_vs_llama2_by_lang.png" width="800" />
|
120 |
-
|
121 |
-
<img src="seallm_vs_llama2_by_cat_sea.png" width="800" />
|
122 |
|
123 |
### M3Exam - World Knowledge in Regional Languages
|
124 |
|
125 |
|
126 |
[M3Exam](https://arxiv.org/pdf/2306.05179.pdf) is a collection of real-life and native official human exam question benchmarks. This benchmark covers questions from multiple countries in the SEA region, which require strong multilingual proficiency and cultural knowledge across various critical educational periods, from primary- to high-school levels of difficulty.
|
127 |
|
128 |
-
As shown in the table, our SeaLLM model outperforms most 13B baselines and reaches closer to ChatGPT's performance.
|
|
|
129 |
|
130 |
|
131 |
| M3Exam / 3-shot (Acc) | En | Zh | Vi | Id | Th
|
@@ -135,8 +140,7 @@ As shown in the table, our SeaLLM model outperforms most 13B baselines and reach
|
|
135 |
| Llama-2-13b | 59.88 | 43.40 | 41.70 | 34.80 | 23.18
|
136 |
| Llama-2-13b-chat | 61.17 | 43.29 | 39.97 | 35.50 | 23.74
|
137 |
| Polylm-13b-chat | 32.23 | 29.26 | 29.01 | 25.36 | 18.08
|
138 |
-
|
|
139 |
-
| SeaLLM-13b-chat | 63.53 | 46.31 | 49.25 | 40.61 | 36.30
|
140 |
|
141 |
|
142 |
### MMLU - Preserving English-based knowledge
|
@@ -147,7 +151,7 @@ On the 5-shot [MMLU](https://arxiv.org/abs/2009.03300), our SeaLLM models not on
|
|
147 |
|-----------| ------- | ------- | ------- | ------- | ------- |
|
148 |
| Llama-2-13b | 44.1 | 52.8 | 62.6 | 61.1 | 54.8
|
149 |
| Llama-2-13b-chat | 43.7 | 49.3 | 62.6 | 60.1 | 53.5
|
150 |
-
| SeaLLM-13b-chat | 43.4 | 53.0 | 63.3 | 61.4 | 55.1
|
151 |
|
152 |
|
153 |
### NLP tasks
|
@@ -162,9 +166,9 @@ As shown in the table below, the 1-shot reading comprehension performance is sig
|
|
162 |
|
163 |
| XQUAD/IndoQA (F1) | En | Zh | Vi | Id | Th | ALL | SEA-lang
|
164 |
|-----------| ------- | ------- | ------- | ------- | ------- | ------- | ------- |
|
165 |
-
| Llama-2-13b | 83.22 | 78.02 | 71.03 | 59.31 | 30.73 | 64.46 | 59.77
|
166 |
| Llama-2-13b-chat | 80.46 | 70.54 | 62.87 | 63.05 | 25.73 | 60.93 | 51.21
|
167 |
-
| SeaLLM-13b-chat | 75.23 | 75.65 | 72.86 | 64.37 | 61.37 | 69.90 | 66.20
|
168 |
|
169 |
|
170 |
#### Translation
|
@@ -176,9 +180,9 @@ Similarly observed, our SeaLLM models outperform Llama-2 significantly in the ne
|
|
176 |
|
177 |
| FloRes-200 (chrF++) | En-Zh | En-Vi | En-Id | En-Th | En->X | Zh-En | Vi-En | Id-En | Th-En | X->En
|
178 |
|-------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
179 |
-
| Llama-2-13b | 24.36 | 53.20 | 60.41 | 22.16 | 45.26 | 53.20 | 59.10 | 63.42 | 38.48 | 53.55
|
180 |
| Llama-2-13b-chat | 19.58 | 51.70 | 57.14 | 21.18 | 37.40 | 52.27 | 54.32 | 60.55 | 30.18 | 49.33
|
181 |
-
| SeaLLM-13b-chat | 23.12 | 59.00 | 66.16 | 43.33 | 47.91 | 53.67 | 60.93 | 65.66 | 57.39 | 59.42
|
182 |
|
183 |
Our models are also performing competitively with ChatGPT for translation between SEA languages without English pivoting.
|
184 |
|
@@ -201,7 +205,7 @@ Lastly, in 2-shot [XL-sum summarization tasks](https://aclanthology.org/2021.fin
|
|
201 |
|
202 |
## Citation
|
203 |
|
204 |
-
If you find our project useful, hope you can star our repo and cite our work as follows:
|
205 |
|
206 |
```
|
207 |
@article{damonlpsg2023seallm,
|
|
|
37 |
- Technical report: To be released.
|
38 |
|
39 |
<blockquote style="color:red">
|
40 |
+
<p><strong style="color: red">Terms of Use and License</strong>:
|
41 |
+
By using our released weights, codes, and demos, you agree to and comply with the terms and conditions specified in our [SeaLLMs Terms Of Use](https://huggingface.co/SeaLLMs/SeaLLM-Chat-13b/edit/main/LICENSE).
|
|
|
|
|
|
|
42 |
</blockquote>
|
43 |
|
44 |
> **Disclaimer**:
|
45 |
+
> We must note that even though the weights, codes, and demos are released in an open manner, similar to other pre-trained language models, and despite our best efforts in red teaming and safety finetuning and enforcement, our models come with potential risks. These risks are influenced by various complex factors, including but not limited to inaccurate, misleading or potentially harmful generation.
|
46 |
> Developers and stakeholders should perform their own red teaming and provide related security measures before deployment, and they must abide by and comply with local governance and regulations.
|
47 |
> In no event shall the authors be held liable for any claim, damages, or other liability arising from the use of the released weights, codes, or demos.
|
48 |
|
|
|
105 |
With the help of native speakers, we built an instruction test set that focuses on various aspects expected in a user-facing chatbot, namely:
|
106 |
(1) NLP tasks (e.g. translation & comprehension), (2) Reasoning, (3) Instruction-following and
|
107 |
(4) Natural and Informal questions. The test set also covers all languages that we are concerned with.
|
|
|
108 |
We use GPT-4 as an evaluator to rate the comparison between our models versus ChatGPT-3.5 and other baselines.
|
109 |
|
110 |
+
Compared with Llama-2-13b-chat, our SeaLLM-13b performs significantly better in all SEA languages,
|
111 |
+
despite the fact that Llama-2 was already trained on a decent data amount of Vi, Id, and Th.
|
112 |
+
In english, our model is 46% as good as Llama-2-13b-chat, even though it did not undergo complex human-labor intensive RLHF.
|
113 |
+
|
114 |
+
<img src="seallm_vs_llama2_by_lang.png" width="800" />
|
115 |
+
|
116 |
+
<img src="seallm_vs_llama2_by_cat_sea.png" width="800" />
|
117 |
+
|
118 |
+
Compared with ChatGPT-3.5, our SeaLLM-13b model is performing 45% as good as ChatGPT for Thai.
|
119 |
+
For important aspects such as Safety and Task-Solving, our model nearly on par with ChatGPT across the languages.
|
120 |
+
|
121 |
<img src="seallm_vs_chatgpt_by_lang.png" width="800" />
|
122 |
|
123 |
<img src="seallm_vs_chatgpt_by_cat_sea.png" width="800" />
|
124 |
|
125 |
|
|
|
|
|
|
|
126 |
|
127 |
### M3Exam - World Knowledge in Regional Languages
|
128 |
|
129 |
|
130 |
[M3Exam](https://arxiv.org/pdf/2306.05179.pdf) is a collection of real-life and native official human exam question benchmarks. This benchmark covers questions from multiple countries in the SEA region, which require strong multilingual proficiency and cultural knowledge across various critical educational periods, from primary- to high-school levels of difficulty.
|
131 |
|
132 |
+
As shown in the table, our SeaLLM model outperforms most 13B baselines and reaches closer to ChatGPT's performance.
|
133 |
+
Notably, for Thai - a seemingly low-resource language, our model is just 1% behind ChatGPT despite the large size difference.
|
134 |
|
135 |
|
136 |
| M3Exam / 3-shot (Acc) | En | Zh | Vi | Id | Th
|
|
|
140 |
| Llama-2-13b | 59.88 | 43.40 | 41.70 | 34.80 | 23.18
|
141 |
| Llama-2-13b-chat | 61.17 | 43.29 | 39.97 | 35.50 | 23.74
|
142 |
| Polylm-13b-chat | 32.23 | 29.26 | 29.01 | 25.36 | 18.08
|
143 |
+
| SeaLLM-13b-chat | **63.53** | **46.31** | **49.25** | **40.61** | **36.30**
|
|
|
144 |
|
145 |
|
146 |
### MMLU - Preserving English-based knowledge
|
|
|
151 |
|-----------| ------- | ------- | ------- | ------- | ------- |
|
152 |
| Llama-2-13b | 44.1 | 52.8 | 62.6 | 61.1 | 54.8
|
153 |
| Llama-2-13b-chat | 43.7 | 49.3 | 62.6 | 60.1 | 53.5
|
154 |
+
| SeaLLM-13b-chat | 43.4 | **53.0** | **63.3** | **61.4** | **55.1**
|
155 |
|
156 |
|
157 |
### NLP tasks
|
|
|
166 |
|
167 |
| XQUAD/IndoQA (F1) | En | Zh | Vi | Id | Th | ALL | SEA-lang
|
168 |
|-----------| ------- | ------- | ------- | ------- | ------- | ------- | ------- |
|
169 |
+
| Llama-2-13b | **83.22** | **78.02** | 71.03 | 59.31 | 30.73 | 64.46 | 59.77
|
170 |
| Llama-2-13b-chat | 80.46 | 70.54 | 62.87 | 63.05 | 25.73 | 60.93 | 51.21
|
171 |
+
| SeaLLM-13b-chat | 75.23 | 75.65 | **72.86** | **64.37** | **61.37** | **69.90** | **66.20**
|
172 |
|
173 |
|
174 |
#### Translation
|
|
|
180 |
|
181 |
| FloRes-200 (chrF++) | En-Zh | En-Vi | En-Id | En-Th | En->X | Zh-En | Vi-En | Id-En | Th-En | X->En
|
182 |
|-------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
183 |
+
| Llama-2-13b | **24.36** | 53.20 | 60.41 | 22.16 | 45.26 | 53.20 | 59.10 | 63.42 | 38.48 | 53.55
|
184 |
| Llama-2-13b-chat | 19.58 | 51.70 | 57.14 | 21.18 | 37.40 | 52.27 | 54.32 | 60.55 | 30.18 | 49.33
|
185 |
+
| SeaLLM-13b-chat | 23.12 | **59.00** | **66.16** | **43.33** | **47.91** | **53.67** | **60.93** | **65.66** | **57.39** | **59.42**
|
186 |
|
187 |
Our models are also performing competitively with ChatGPT for translation between SEA languages without English pivoting.
|
188 |
|
|
|
205 |
|
206 |
## Citation
|
207 |
|
208 |
+
If you find our project useful, hope you can star our repo and cite our work as follows. Corresponding Author: [l.bing@alibaba-inc.com](mailto:l.bing@alibaba-inc.com)
|
209 |
|
210 |
```
|
211 |
@article{damonlpsg2023seallm,
|