ChloeAuYeung commited on
Commit
3f1c9ac
1 Parent(s): 6dbe66b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -26
README.md CHANGED
@@ -147,19 +147,19 @@ For the Code data, the following table shows the proportion of different program
147
 
148
  为了综合评估模型的性能,我们在一系列标准数据集上进行了全面测试,包括C-Eval、CMMLU、Gaokao-Bench、MMLU、GAOKAO-English、AGIEval、RACE-M、CommonSenseQA、PIQA、GSM8K和HumanEval。这些评估覆盖了模型在多个领域的能力,具体包括中文问答、英文问答、语言理解、常识问答、逻辑推理、数学问题解答以及编程能力。评估结果如下:
149
 
150
- | 能力维度 | 数据集 | | XVERSE-65B | Llama1-65B | Llama2-70B | Falcon-180B | GPT-3.5 | GPT-4 |
151
- | :--------: | :------------------------: | :----: | :--------: | :--------: | :--------: | :---------: | :-----: | :---: |
152
- | 中文问答 | C-Eval | 5-shot | 68.6 | 38.8 | 49.9 | 54.2 | 54.4 | 68.7 |
153
- | | CMMLU | 5-shot | 72.6 | 40.6 | 53.6 | 57.2 | 53.9 | 71.0 |
154
- | | Gaokao-Bench<sup>1</sup> | 5-shot | 73.9 | 38.9 | 51.4 | 50.5 | - | - |
155
- | 英文问答 | MMLU | 5-shot | 70.8 | 63.4 | 68.9 | 70.5 | 70.0 | 86.4 |
156
- | | GAOKAO-English<sup>1</sup> | 5-shot | 85.3 | 67.0 | 76.6 | 63.3 | - | - |
157
- | 中英文问答 | AGIEval<sup>1</sup> | 5-shot | 61.8 | 42.4 | 51.4 | 51.3 | - | - |
158
- | 语言理解 | RACE-M | 0-shot | 90.6 | 67.9 | 81.5 | 87.6 | 85.6 | 93.7 |
159
- | 常识问答 | CommonSenseQA | 7-shot | 79.8 | 74.0 | 78.5 | 82.4 | 80.2 | 88.3 |
160
- | 推理 | PIQA | 0-shot | 80.4 | 82.8 | 82.8 | 85.3 | 81.7 | 89.2 |
161
- | 数学 | GSM8K | 4-shot | 60.3 | 50.9 | 56.8 | 62.6 | 57.1 | 92.0 |
162
- | 代码 | HumanEval | 0-shot | 26.8 | 23.7 | 29.9 | - | 48.1 | 67.0 |
163
 
164
  > <sup>1:只针对其中的单项选择题进行测试,即排除了填空题、开放性问题和多项选择题</sup>
165
 
@@ -170,19 +170,19 @@ For the Code data, the following table shows the proportion of different program
170
 
171
  To comprehensively assess the performance of the model, we conducted extensive testing across a range of standard datasets, including C-Eval, CMMLU, Gaokao-Bench, MMLU, GAOKAO-English, AGIEval, RACE-M, CommonSenseQA, PIQA, GSM8K and HumanEval. These evaluations spanned multiple capabilities of the model, specifically including Chinese question answering, English question answering, language comprehension, common sense questioning, logical reasoning, mathematical problem-solving, and coding ability. The results of the evaluations are as follows:
172
 
173
- | Capability Dimension | Dataset | | XVERSE-65B | Llama1-65B | Llama2-70B | Falcon-180B | GPT-3.5 | GPT-4 |
174
- | :--------------------: | :------------------------: | :----: | :--------: | :--------: | :--------: | :---------: | :-----: | :---: |
175
- | Chinese QA | C-Eval | 5-shot | 68.6 | 38.8 | 49.9 | 54.2 | 54.4 | 68.7 |
176
- | | CMMLU | 5-shot | 72.6 | 40.6 | 53.6 | 57.2 | 53.9 | 71.0 |
177
- | | Gaokao-Bench<sup>1</sup> | 5-shot | 73.9 | 38.9 | 51.4 | 50.5 | - | - |
178
- | English QA | MMLU | 5-shot | 70.8 | 63.4 | 68.9 | 70.5 | 70.0 | 86.4 |
179
- | | GAOKAO-English<sup>1</sup> | 5-shot | 85.3 | 67.0 | 76.6 | 63.3 | - | - |
180
- | Chinese & English QA | AGIEval<sup>1</sup> | 5-shot | 61.8 | 42.4 | 51.4 | 51.3 | - | - |
181
- | Language Understanding | RACE-M | 0-shot | 90.6 | 67.9 | 81.5 | 87.6 | 85.6 | 93.7 |
182
- | Common Sense QA | CommonSenseQA | 7-shot | 79.8 | 74.0 | 78.5 | 82.4 | 80.2 | 88.3 |
183
- | Reasoning | PIQA | 0-shot | 80.4 | 82.8 | 82.8 | 85.3 | 81.7 | 89.2 |
184
- | Math | GSM8K | 4-shot | 60.3 | 50.9 | 56.8 | 62.6 | 57.1 | 92.0 |
185
- | Coding | HumanEval | 0-shot | 26.8 | 23.7 | 29.9 | - | 48.1 | 67.0 |
186
 
187
  > <sup>1: Tests are conducted only on single-answer multiple-choice questions, thus excluding fill-in-the-blanks, open-ended questions, and multiple-answer multiple-choice questions.</sup>
188
 
 
147
 
148
  为了综合评估模型的性能,我们在一系列标准数据集上进行了全面测试,包括C-Eval、CMMLU、Gaokao-Bench、MMLU、GAOKAO-English、AGIEval、RACE-M、CommonSenseQA、PIQA、GSM8K和HumanEval。这些评估覆盖了模型在多个领域的能力,具体包括中文问答、英文问答、语言理解、常识问答、逻辑推理、数学问题解答以及编程能力。评估结果如下:
149
 
150
+ | 能力维度 | 数据集 | | XVERSE-65B-2 | XVERSE-65B | Llama1-65B | Llama2-70B | Falcon-180B | GPT-3.5 | GPT-4 |
151
+ | :--------: | :------------------------: | :----: | :----------: | :--------: | :--------: | :--------: | :---------: | :-----: | :---: |
152
+ | 中文问答 | C-Eval | 5-shot | 72.4 | 68.6 | 38.8 | 49.9 | 54.2 | 54.4 | 68.7 |
153
+ | | CMMLU | 5-shot | 75.1 | 72.6 | 40.6 | 53.6 | 57.2 | 53.9 | 71.0 |
154
+ | | Gaokao-Bench<sup>1</sup> | 5-shot | 76.9 | 73.9 | 38.9 | 51.4 | 50.5 | - | - |
155
+ | 英文问答 | MMLU | 5-shot | 74.4 | 70.8 | 63.4 | 68.9 | 70.5 | 70.0 | 86.4 |
156
+ | | GAOKAO-English<sup>1</sup> | 5-shot | 86.6 | 85.3 | 67.0 | 76.6 | 63.3 | - | - |
157
+ | 中英文问答 | AGIEval<sup>1</sup> | 5-shot | 66.2 | 61.8 | 42.4 | 51.4 | 51.3 | - | - |
158
+ | 语言理解 | RACE-M | 0-shot | 90.7 | 90.6 | 67.9 | 81.5 | 87.6 | 85.6 | 93.7 |
159
+ | 常识问答 | CommonSenseQA | 7-shot | 81.1 | 79.8 | 74.0 | 78.5 | 82.4 | 80.2 | 88.3 |
160
+ | 推理 | PIQA | 0-shot | 79.4 | 80.4 | 82.8 | 82.8 | 85.3 | 81.7 | 89.2 |
161
+ | 数学 | GSM8K | 4-shot | 72.6 | 60.3 | 50.9 | 56.8 | 62.6 | 57.1 | 92.0 |
162
+ | 代码 | HumanEval | 0-shot | 37.8 | 26.8 | 23.7 | 29.9 | - | 48.1 | 67.0 |
163
 
164
  > <sup>1:只针对其中的单项选择题进行测试,即排除了填空题、开放性问题和多项选择题</sup>
165
 
 
170
 
171
  To comprehensively assess the performance of the model, we conducted extensive testing across a range of standard datasets, including C-Eval, CMMLU, Gaokao-Bench, MMLU, GAOKAO-English, AGIEval, RACE-M, CommonSenseQA, PIQA, GSM8K and HumanEval. These evaluations spanned multiple capabilities of the model, specifically including Chinese question answering, English question answering, language comprehension, common sense questioning, logical reasoning, mathematical problem-solving, and coding ability. The results of the evaluations are as follows:
172
 
173
+ | Capability Dimension | Dataset | | XVERSE-65B-2 | XVERSE-65B | Llama1-65B | Llama2-70B | Falcon-180B | GPT-3.5 | GPT-4 |
174
+ | :--------------------: | :------------------------: | :----: | :----------: | :--------: | :--------: | :--------: | :---------: | :-----: | :---: |
175
+ | Chinese QA | C-Eval | 5-shot | 72.4 | 68.6 | 38.8 | 49.9 | 54.2 | 54.4 | 68.7 |
176
+ | | CMMLU | 5-shot | 75.1 | 72.6 | 40.6 | 53.6 | 57.2 | 53.9 | 71.0 |
177
+ | | Gaokao-Bench<sup>1</sup> | 5-shot | 76.9 | 73.9 | 38.9 | 51.4 | 50.5 | - | - |
178
+ | English QA | MMLU | 5-shot | 74.4 | 70.8 | 63.4 | 68.9 | 70.5 | 70.0 | 86.4 |
179
+ | | GAOKAO-English<sup>1</sup> | 5-shot | 86.6 | 85.3 | 67.0 | 76.6 | 63.3 | - | - |
180
+ | Chinese & English QA | AGIEval<sup>1</sup> | 5-shot | 66.2 | 61.8 | 42.4 | 51.4 | 51.3 | - | - |
181
+ | Language Understanding | RACE-M | 0-shot | 90.7 | 90.6 | 67.9 | 81.5 | 87.6 | 85.6 | 93.7 |
182
+ | Common Sense QA | CommonSenseQA | 7-shot | 81.1 | 79.8 | 74.0 | 78.5 | 82.4 | 80.2 | 88.3 |
183
+ | Reasoning | PIQA | 0-shot | 79.4 | 80.4 | 82.8 | 82.8 | 85.3 | 81.7 | 89.2 |
184
+ | Math | GSM8K | 4-shot | 72.6 | 60.3 | 50.9 | 56.8 | 62.6 | 57.1 | 92.0 |
185
+ | Coding | HumanEval | 0-shot | 37.8 | 26.8 | 23.7 | 29.9 | - | 48.1 | 67.0 |
186
 
187
  > <sup>1: Tests are conducted only on single-answer multiple-choice questions, thus excluding fill-in-the-blanks, open-ended questions, and multiple-answer multiple-choice questions.</sup>
188