Popular aggregated benchmark |
Arena Hard |
37 |
39.4 |
25.7 |
42 |
55.2 |
|
BigBench Hard CoT (0-shot) |
69 |
60.2 |
63.4 |
63.5 |
66.7 |
|
MMLU (5-shot) |
69 |
67.2 |
68.1 |
71.3 |
78.7 |
|
MMLU-Pro (0-shot, CoT) |
47.4 |
40.7 |
44 |
50.1 |
57.2 |
Reasoning |
ARC Challenge (10-shot) |
84.6 |
84.8 |
83.1 |
89.8 |
92.8 |
|
TruthfulQA (MC2) (10-shot) |
64 |
68.1 |
69.2 |
76.6 |
76.6 |
|
WinoGrande (5-shot) |
68.5 |
70.4 |
64.7 |
74 |
74.7 |
Multilingual |
Multilingual MMLU (5-shot) |
55.4 |
58.9 |
56.2 |
63.8 |
77.2 |
Math |
GSM8K (8-shot, CoT) |
86.2 |
84.2 |
82.4 |
84.9 |
82.4 |
|
MATH (0-shot, CoT) |
48.5 |
31.2 |
47.6 |
50.9 |
38 |
Long context |
Qasper |
41.9 |
30.7 |
37.2 |
13.9 |
43.5 |
|
SQuALITY |
24.3 |
25.8 |
26.2 |
0 |
23.5 |
Code Generation |
HumanEval (0-shot) |
62.8 |
63.4 |
66.5 |
61 |
74.4 |
|
MBPP (3-shot) |
69.6 |
68.1 |
69.4 |
69.3 |
77.5 |