Update README.md
Browse files
README.md
CHANGED
@@ -34,6 +34,57 @@ KeyError: 'qwen2'
|
|
34 |
We do not advise you to use base language models for text generation. Instead, you can apply post-training, e.g., SFT, RLHF, continued pretraining, etc., on this model.
|
35 |
|
36 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
37 |
## Citation
|
38 |
|
39 |
If you find our work helpful, feel free to give us a cite.
|
|
|
34 |
We do not advise you to use base language models for text generation. Instead, you can apply post-training, e.g., SFT, RLHF, continued pretraining, etc., on this model.
|
35 |
|
36 |
|
37 |
+
### Performance
|
38 |
+
|
39 |
+
The evaluation of base models mainly focuses on the model performance of natural language understanding, general question answering, coding, mathematics, scientific knowledge, reasoning, multilingual capability, etc.
|
40 |
+
|
41 |
+
The datasets for evaluation include:
|
42 |
+
|
43 |
+
**English Tasks**: MMLU (5-shot), MMLU-Pro (5-shot), GPQA (5shot), Theorem QA (5-shot), BBH (3-shot), HellaSwag (10-shot), Winogrande (5-shot), TruthfulQA (0-shot), ARC-C (25-shot)
|
44 |
+
|
45 |
+
**Coding Tasks**: EvalPlus (0-shot) (HumanEval, MBPP, HumanEval+, MBPP+), MultiPL-E (0-shot) (Python, C++, JAVA, PHP, TypeScript, C#, Bash, JavaScript)
|
46 |
+
|
47 |
+
**Math Tasks**: GSM8K (4-shot), MATH (4-shot)
|
48 |
+
|
49 |
+
**Chinese Tasks**: C-Eval(5-shot), CMMLU (5-shot)
|
50 |
+
|
51 |
+
**Multilingual Tasks**: Multi-Exam (M3Exam 5-shot, IndoMMLU 3-shot, ruMMLU 5-shot, mMMLU 5-shot), Multi-Understanding (BELEBELE 5-shot, XCOPA 5-shot, XWinograd 5-shot, XStoryCloze 0-shot, PAWS-X 5-shot), Multi-Mathematics (MGSM 8-shot), Multi-Translation (Flores-101 5-shot)
|
52 |
+
|
53 |
+
|
54 |
+
|
55 |
+
#### Qwen2-7B performance
|
56 |
+
| Datasets | Mistral-7B | Gemma-7B | Llama-3-8B | Qwen1.5-7B | Qwen2-7B |
|
57 |
+
| :--------| :---------: | :------------: | :------------: | :------------: | :------------: |
|
58 |
+
|# Params | 7.2B | 8.5B | 8.0B | 7.7B | 7.6B |
|
59 |
+
|# Non-emb Params | 7.0B | 7.8B | 7.0B | 6.5B | 6.5B |
|
60 |
+
| ***English*** | | | | | |
|
61 |
+
|MMLU | 64.2 | 64.6 | 66.6 | 61.0 | **70.3** |
|
62 |
+
|MMLU-Pro | 30.9 | 33.7 | 35.4 | 29.9 | **40.0** |
|
63 |
+
|GPQA | 24.7 | 25.7 | 25.8 | 26.7 | **31.8** |
|
64 |
+
|Theorem QA | 19.2 | 21.5 | 22.1 | 14.2 | **31.1** |
|
65 |
+
|BBH | 56.1 | 55.1 | 57.7 | 40.2 | **62.6** |
|
66 |
+
|HellaSwag | **83.2** | 82.2 | 82.1 | 78.5 | 80.7 |
|
67 |
+
|Winogrande | 78.4 | **79.0** | 77.4 | 71.3 | 77.0 |
|
68 |
+
|ARC-C | 60.0 | **61.1** | 59.3 | 54.2 | 60.6 |
|
69 |
+
|TruthfulQA | 42.2 | 44.8 | 44.0 | 51.1 | **54.2** |
|
70 |
+
| ***Coding*** | | | | | |
|
71 |
+
|HumanEval | 29.3 | 37.2 | 33.5 | 36.0 | **51.2** |
|
72 |
+
|MBPP | 51.1 | 50.6 | 53.9 | 51.6 | **65.9** |
|
73 |
+
|EvalPlus | 36.4 | 39.6 | 40.3 | 40.0 | **54.2** |
|
74 |
+
|MultiPL-E | 29.4 | 29.7 | 22.6 | 28.1 | **46.3** |
|
75 |
+
| ***Mathematics*** | | | | | |
|
76 |
+
|GSM8K | 52.2 | 46.4 | 56.0 | 62.5 | **79.9** |
|
77 |
+
|MATH | 13.1 | 24.3 | 20.5 | 20.3 | **44.2** |
|
78 |
+
| ***Chinese*** | | | | | |
|
79 |
+
|C-Eval | 47.4 | 43.6 | 49.5 | 74.1 | **83.2** |
|
80 |
+
|CMMLU | - | - | 50.8 | 73.1 | **83.9** |
|
81 |
+
| ***Multilingual*** | | | | | |
|
82 |
+
|Multi-Exam | 47.1 | 42.7 | 52.3 | 47.7 | **59.2** |
|
83 |
+
|Multi-Understanding | 63.3 | 58.3 | 68.6 | 67.6 | **72.0** |
|
84 |
+
|Multi-Mathematics | 26.3 | 39.1 | 36.3 | 37.3 | **57.5** |
|
85 |
+
|Multi-Translation | 23.3 | 31.2 | **31.9** | 28.4 | 31.5 |
|
86 |
+
|
87 |
+
|
88 |
## Citation
|
89 |
|
90 |
If you find our work helpful, feel free to give us a cite.
|