Update README.md
Browse files
README.md
CHANGED
@@ -46,14 +46,14 @@ InternLM2.5 has open-sourced a 7 billion parameter base model and a chat model t
|
|
46 |
|
47 |
We conducted a comprehensive evaluation of InternLM using the open-source evaluation tool [OpenCompass](https://github.com/internLM/OpenCompass/). The evaluation covered five dimensions of capabilities: disciplinary competence, language competence, knowledge competence, inference competence, and comprehension competence. Here are some of the evaluation results, and you can visit the [OpenCompass leaderboard](https://rank.opencompass.org.cn) for more evaluation results.
|
48 |
|
49 |
-
|
|
50 |
-
|
|
51 |
-
|MMLU |
|
52 |
-
|CMMLU
|
53 |
-
|BBH |
|
54 |
-
|MATH
|
55 |
-
| GSM8K
|
56 |
-
|GPQA | 38.4
|
57 |
|
58 |
|
59 |
- The evaluation results were obtained from [OpenCompass](https://github.com/internLM/OpenCompass/) (some data marked with *, which means come from the original papers), and evaluation configuration can be found in the configuration files provided by [OpenCompass](https://github.com/internLM/OpenCompass/).
|
@@ -197,14 +197,14 @@ InternLM2.5 ,即书生·浦语大模型第 2.5 代,开源了面向实用场
|
|
197 |
|
198 |
我们使用开源评测工具 [OpenCompass](https://github.com/internLM/OpenCompass/) 从学科综合能力、语言能力、知识能力、推理能力、理解能力五大能力维度对InternLM开展全面评测,部分评测结果如下表所示,欢迎访问[ OpenCompass 榜单 ](https://rank.opencompass.org.cn)获取更多的评测结果。
|
199 |
|
200 |
-
| 评测集\模型
|
201 |
-
|
|
202 |
-
|MMLU |
|
203 |
-
|CMMLU
|
204 |
-
|BBH |
|
205 |
-
|MATH
|
206 |
-
| GSM8K
|
207 |
-
|GPQA | 38.4
|
208 |
|
209 |
- 以上评测结果基于 [OpenCompass](https://github.com/internLM/OpenCompass/) 获得(部分数据标注`*`代表数据来自原始论文),具体测试细节可参见 [OpenCompass](https://github.com/internLM/OpenCompass/) 中提供的配置文件。
|
210 |
- 评测数据会因 [OpenCompass](https://github.com/internLM/OpenCompass/) 的版本迭代而存在数值差异,请以 [OpenCompass](https://github.com/internLM/OpenCompass/) 最新版的评测结果为主。
|
|
|
46 |
|
47 |
We conducted a comprehensive evaluation of InternLM using the open-source evaluation tool [OpenCompass](https://github.com/internLM/OpenCompass/). The evaluation covered five dimensions of capabilities: disciplinary competence, language competence, knowledge competence, inference competence, and comprehension competence. Here are some of the evaluation results, and you can visit the [OpenCompass leaderboard](https://rank.opencompass.org.cn) for more evaluation results.
|
48 |
|
49 |
+
| Benchmark | InternLM2.5-7B-Chat | Llama3-8B-Instruct | Gemma2-9B-IT | Yi-1.5-9B-Chat | GLM-4-9B-Chat | Qwen2-7B-Instruct |
|
50 |
+
| ------------------ | ------------------- | ------------------ | ------------ | -------------- | ------------- | ----------------- |
|
51 |
+
| MMLU (5-shot) | **72.8** | 68.4 | 70.9 | 71.0 | 71.4 | 70.8 |
|
52 |
+
| CMMLU (5-shot) | 78.0 | 53.3 | 60.3 | 74.5 | 74.5 | 80.9 |
|
53 |
+
| BBH (3-shot CoT) | **71.6** | 54.4 | 68.2\* | 69.6 | 69.6 | 65.0 |
|
54 |
+
| MATH (0-shot CoT) | **60.1** | 27.9 | 46.9 | 51.1 | 51.1 | 48.6 |
|
55 |
+
| GSM8K (0-shot CoT) | 86.0 | 72.9 | 88.9 | 80.1 | 85.3 | 82.9 |
|
56 |
+
| GPQA (0-shot) | **38.4** | 26.1 | 33.8 | 37.9 | 36.9 | 38.4 |
|
57 |
|
58 |
|
59 |
- The evaluation results were obtained from [OpenCompass](https://github.com/internLM/OpenCompass/) (some data marked with *, which means come from the original papers), and evaluation configuration can be found in the configuration files provided by [OpenCompass](https://github.com/internLM/OpenCompass/).
|
|
|
197 |
|
198 |
我们使用开源评测工具 [OpenCompass](https://github.com/internLM/OpenCompass/) 从学科综合能力、语言能力、知识能力、推理能力、理解能力五大能力维度对InternLM开展全面评测,部分评测结果如下表所示,欢迎访问[ OpenCompass 榜单 ](https://rank.opencompass.org.cn)获取更多的评测结果。
|
199 |
|
200 |
+
| 评测集\模型 | InternLM2.5-7B-Chat | Llama3-8B-Instruct | Gemma2-9B-IT | Yi-1.5-9B-Chat | GLM-4-9B-Chat | Qwen2-7B-Instruct |
|
201 |
+
| ------------------ | ------------------- | ------------------ | ------------ | -------------- | ------------- | ----------------- |
|
202 |
+
| MMLU (5-shot) | **72.8** | 68.4 | 70.9 | 71.0 | 71.4 | 70.8 |
|
203 |
+
| CMMLU (5-shot) | 78.0 | 53.3 | 60.3 | 74.5 | 74.5 | 80.9 |
|
204 |
+
| BBH (3-shot CoT) | **71.6** | 54.4 | 68.2\* | 69.6 | 69.6 | 65.0 |
|
205 |
+
| MATH (0-shot CoT) | **60.1** | 27.9 | 46.9 | 51.1 | 51.1 | 48.6 |
|
206 |
+
| GSM8K (0-shot CoT) | 86.0 | 72.9 | 88.9 | 80.1 | 85.3 | 82.9 |
|
207 |
+
| GPQA (0-shot) | **38.4** | 26.1 | 33.8 | 37.9 | 36.9 | 38.4 |
|
208 |
|
209 |
- 以上评测结果基于 [OpenCompass](https://github.com/internLM/OpenCompass/) 获得(部分数据标注`*`代表数据来自原始论文),具体测试细节可参见 [OpenCompass](https://github.com/internLM/OpenCompass/) 中提供的配置文件。
|
210 |
- 评测数据会因 [OpenCompass](https://github.com/internLM/OpenCompass/) 的版本迭代而存在数值差异,请以 [OpenCompass](https://github.com/internLM/OpenCompass/) 最新版的评测结果为主。
|