mlabonne commited on
Commit
8f4b077
1 Parent(s): 5aa70ec

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -4
README.md CHANGED
@@ -21,13 +21,13 @@ Marcoro14-7B-slerp is the second best-performing 7B LLM on the Open LLM Leaderbo
21
 
22
  I also evaluated it using Nous' benchmark suite and obtained the following results:
23
 
24
- | Model |agieval|gpt4all|truthfulqa|bigbench|Average|
25
  |-------------------------|------:|------:|---------:|-------:|------:|
26
  |Marcoro14-7B-slerp | 44.66| 76.24| 64.15| 45.64| 57.67|
27
  |OpenHermes-2.5-Mistral-7B| 43.07| 73.12| 53.04| 40.96| 52.57|
28
  |Change | +1.59| +3.12| +11.11| +4.68| +5.1|
29
 
30
- ### AGIEVAL
31
  | Task |Version| Metric |Value| |Stderr|
32
  |------------------------------|------:|--------|----:|---|-----:|
33
  |agieval_aqua_rat | 0|acc |26.38|± | 2.77|
@@ -46,6 +46,7 @@ I also evaluated it using Nous' benchmark suite and obtained the following resul
46
  | | |acc_norm|45.63|± | 3.48|
47
  |agieval_sat_math | 0|acc |33.18|± | 3.18|
48
  | | |acc_norm|30.45|± | 3.11|
 
49
  Average: 44.66%
50
 
51
  ### GPT4ALL
@@ -63,16 +64,18 @@ Average: 44.66%
63
  |piqa | 0|acc |82.59|± | 0.88|
64
  | | |acc_norm|84.39|± | 0.85|
65
  |winogrande | 0|acc |78.53|± | 1.15|
 
66
  Average: 76.24%
67
 
68
- ### TRUTHFULQA
69
  | Task |Version|Metric|Value| |Stderr|
70
  |-------------|------:|------|----:|---|-----:|
71
  |truthfulqa_mc| 1|mc1 |46.88|± | 1.75|
72
  | | |mc2 |64.15|± | 1.52|
 
73
  Average: 64.15%
74
 
75
- ### BIGBENCH
76
  | Task |Version| Metric |Value| |Stderr|
77
  |------------------------------------------------|------:|---------------------|----:|---|-----:|
78
  |bigbench_causal_judgement | 0|multiple_choice_grade|56.32|± | 3.61|
@@ -94,6 +97,7 @@ Average: 64.15%
94
  |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|23.44|± | 1.20|
95
  |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|18.51|± | 0.93|
96
  |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|52.33|± | 2.89|
 
97
  Average: 45.64%
98
 
99
  Average score: 57.67%
 
21
 
22
  I also evaluated it using Nous' benchmark suite and obtained the following results:
23
 
24
+ | Model |AGIEval|GPT4ALL|TruthfulQA|Bigbench|Average|
25
  |-------------------------|------:|------:|---------:|-------:|------:|
26
  |Marcoro14-7B-slerp | 44.66| 76.24| 64.15| 45.64| 57.67|
27
  |OpenHermes-2.5-Mistral-7B| 43.07| 73.12| 53.04| 40.96| 52.57|
28
  |Change | +1.59| +3.12| +11.11| +4.68| +5.1|
29
 
30
+ ### AGIEval
31
  | Task |Version| Metric |Value| |Stderr|
32
  |------------------------------|------:|--------|----:|---|-----:|
33
  |agieval_aqua_rat | 0|acc |26.38|± | 2.77|
 
46
  | | |acc_norm|45.63|± | 3.48|
47
  |agieval_sat_math | 0|acc |33.18|± | 3.18|
48
  | | |acc_norm|30.45|± | 3.11|
49
+
50
  Average: 44.66%
51
 
52
  ### GPT4ALL
 
64
  |piqa | 0|acc |82.59|± | 0.88|
65
  | | |acc_norm|84.39|± | 0.85|
66
  |winogrande | 0|acc |78.53|± | 1.15|
67
+
68
  Average: 76.24%
69
 
70
+ ### TruthfulQA
71
  | Task |Version|Metric|Value| |Stderr|
72
  |-------------|------:|------|----:|---|-----:|
73
  |truthfulqa_mc| 1|mc1 |46.88|± | 1.75|
74
  | | |mc2 |64.15|± | 1.52|
75
+
76
  Average: 64.15%
77
 
78
+ ### Bigbench
79
  | Task |Version| Metric |Value| |Stderr|
80
  |------------------------------------------------|------:|---------------------|----:|---|-----:|
81
  |bigbench_causal_judgement | 0|multiple_choice_grade|56.32|± | 3.61|
 
97
  |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|23.44|± | 1.20|
98
  |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|18.51|± | 0.93|
99
  |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|52.33|± | 2.89|
100
+
101
  Average: 45.64%
102
 
103
  Average score: 57.67%