Losin94 commited on
Commit
c03367e
1 Parent(s): e5cebed

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -0
README.md CHANGED
@@ -36,6 +36,55 @@ KeyError: 'qwen2'
36
  We do not advise you to use base language models for text generation. Instead, you can apply post-training, e.g., SFT, RLHF, continued pretraining, etc., on this model.
37
 
38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
  ## Citation
40
 
41
  If you find our work helpful, feel free to give us a cite.
 
36
  We do not advise you to use base language models for text generation. Instead, you can apply post-training, e.g., SFT, RLHF, continued pretraining, etc., on this model.
37
 
38
 
39
+ ## Performance
40
+
41
+ The evaluation of base models mainly focuses on the model performance of natural language understanding, general question answering, coding, mathematics, scientific knowledge, reasoning, multilingual capability, etc.
42
+
43
+ The datasets for evaluation include:
44
+
45
+ **English Tasks**: MMLU (5-shot), MMLU-Pro (5-shot), GPQA (5shot), Theorem QA (5-shot), BBH (3-shot), HellaSwag (10-shot), Winogrande (5-shot), TruthfulQA (0-shot), ARC-C (25-shot)
46
+
47
+ **Coding Tasks**: EvalPlus (0-shot) (HumanEval, MBPP, HumanEval+, MBPP+), MultiPL-E (0-shot) (Python, C++, JAVA, PHP, TypeScript, C#, Bash, JavaScript)
48
+
49
+ **Math Tasks**: GSM8K (4-shot), MATH (4-shot)
50
+
51
+ **Chinese Tasks**: C-Eval (5-shot), CMMLU (5-shot)
52
+
53
+ **Multilingual Tasks**: Multi-Exam (M3Exam 5-shot, IndoMMLU 3-shot, ruMMLU 5-shot, mMMLU 5-shot), Multi-Understanding (BELEBELE 5-shot, XCOPA 5-shot, XWinograd 5-shot, XStoryCloze 0-shot, PAWS-X 5-shot), Multi-Mathematics (MGSM 8-shot), Multi-Translation (Flores-101 5-shot)
54
+
55
+ #### Qwen2-72B performance
56
+ | Datasets | DeepSeek-V2 | Mixtral-8x22B | Llama-3-70B | Qwen1.5-72B | Qwen1.5-110B | **Qwen2-72B** |
57
+ | :--------| :---------: | :------------: | :------------: | :------------: | :------------: |:------------: |
58
+ |Architecture | MoE | MoE | Dense | Dense | Dense | Dense |
59
+ |#Activated Params | 21B | 39B | 70B | 72B | 110B | 72B |
60
+ |#Params | 236B | 140B | 70B | 72B | 110B | 72B|
61
+ | ***English*** | | | | | | |
62
+ |MMLU |78.5 | 77.8 | 79.5 | 77.5 | 80.4 | **84.2** |
63
+ |MMLU-Pro | - | 49.5 | 52.8 | 45.8 | 49.4 | **55.6** |
64
+ |GPQA | -| 34.3 | 36.3 | 36.3 | 35.9 | **37.9** |
65
+ |Theorem QA | -| 35.9 | 32.3 | 29.3 | 34.9 | **43.1** |
66
+ |BBH | 78.9 |78.9 | 81.0 | 65.5 | 74.8 | **82.4** |
67
+ |HellaSwag | 87.8 | **88.7** | 88.0 | 86.0 | 87.5 | 87.6 |
68
+ |WindoGrande | 84.8|85.0 | **85.3** | 83.0 | 83.5 | 85.1 |
69
+ |ARC-C | 70.0| **70.7** | 68.8 | 65.9 | 69.6 | 68.9 |
70
+ |TruthfulQA | 42.2 | 51.0 | 45.6 | **59.6** | 49.6 | 54.8 |
71
+ | ***Coding*** | | | | | | |
72
+ |HumanEval | 45.7 | 46.3 | 48.2 | 46.3 | 54.3 | **64.6** |
73
+ |MBPP |73.9 | 71.7 | 70.4 | 66.9 | 70.9 | **76.9** |
74
+ |EvalPlus | 55.0 | 54.1 | 54.8 | 52.9 | 57.7 | **65.4** |
75
+ |MultiPL-E |44.4 | 46.7 | 46.3 | 41.8 | 52.7 | **59.6** |
76
+ | ***Mathematics*** | | | | | | |
77
+ |GSM8K | 79.2 | 83.7 | 83.0 | 79.5 | 85.4 | **89.5** |
78
+ |MATH | 43.6 | 41.7 | 42.5 | 34.1 | 49.6 | **51.1** |
79
+ | ***Chinese*** | | | | | | |
80
+ |C-Eval | 81.7 | 54.6 | 65.2 | 84.1 | 89.1 | **91.0** |
81
+ |CMMLU | 84.0 | 53.4 | 67.2 | 83.5 | 88.3 | **90.1** |
82
+ | ***Multilingual*** | | | | | | |
83
+ |Mulit-Exam | 67.5 | 63.5 | 70.0 | 66.4 | 75.6 | **76.6** |
84
+ |Multi-Understanding | 77.0 | 77.7 | 79.9 | 78.2 | 78.2 | **80.7** |
85
+ |Multi-Mathematics | 58.8 | 62.9 | 67.1 | 61.7 | 64.4 | **76.0** |
86
+ |Multi-Translation | 36.0 | 23.3 | **38.0** | 35.6 | 36.2 | 37.8 |
87
+
88
  ## Citation
89
 
90
  If you find our work helpful, feel free to give us a cite.