Update README.md
Browse files
README.md
CHANGED
@@ -21,13 +21,13 @@ Marcoro14-7B-slerp is the second best-performing 7B LLM on the Open LLM Leaderbo
|
|
21 |
|
22 |
I also evaluated it using Nous' benchmark suite and obtained the following results:
|
23 |
|
24 |
-
| Model |
|
25 |
|-------------------------|------:|------:|---------:|-------:|------:|
|
26 |
|Marcoro14-7B-slerp | 44.66| 76.24| 64.15| 45.64| 57.67|
|
27 |
|OpenHermes-2.5-Mistral-7B| 43.07| 73.12| 53.04| 40.96| 52.57|
|
28 |
|Change | +1.59| +3.12| +11.11| +4.68| +5.1|
|
29 |
|
30 |
-
###
|
31 |
| Task |Version| Metric |Value| |Stderr|
|
32 |
|------------------------------|------:|--------|----:|---|-----:|
|
33 |
|agieval_aqua_rat | 0|acc |26.38|± | 2.77|
|
@@ -46,6 +46,7 @@ I also evaluated it using Nous' benchmark suite and obtained the following resul
|
|
46 |
| | |acc_norm|45.63|± | 3.48|
|
47 |
|agieval_sat_math | 0|acc |33.18|± | 3.18|
|
48 |
| | |acc_norm|30.45|± | 3.11|
|
|
|
49 |
Average: 44.66%
|
50 |
|
51 |
### GPT4ALL
|
@@ -63,16 +64,18 @@ Average: 44.66%
|
|
63 |
|piqa | 0|acc |82.59|± | 0.88|
|
64 |
| | |acc_norm|84.39|± | 0.85|
|
65 |
|winogrande | 0|acc |78.53|± | 1.15|
|
|
|
66 |
Average: 76.24%
|
67 |
|
68 |
-
###
|
69 |
| Task |Version|Metric|Value| |Stderr|
|
70 |
|-------------|------:|------|----:|---|-----:|
|
71 |
|truthfulqa_mc| 1|mc1 |46.88|± | 1.75|
|
72 |
| | |mc2 |64.15|± | 1.52|
|
|
|
73 |
Average: 64.15%
|
74 |
|
75 |
-
###
|
76 |
| Task |Version| Metric |Value| |Stderr|
|
77 |
|------------------------------------------------|------:|---------------------|----:|---|-----:|
|
78 |
|bigbench_causal_judgement | 0|multiple_choice_grade|56.32|± | 3.61|
|
@@ -94,6 +97,7 @@ Average: 64.15%
|
|
94 |
|bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|23.44|± | 1.20|
|
95 |
|bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|18.51|± | 0.93|
|
96 |
|bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|52.33|± | 2.89|
|
|
|
97 |
Average: 45.64%
|
98 |
|
99 |
Average score: 57.67%
|
|
|
21 |
|
22 |
I also evaluated it using Nous' benchmark suite and obtained the following results:
|
23 |
|
24 |
+
| Model |AGIEval|GPT4ALL|TruthfulQA|Bigbench|Average|
|
25 |
|-------------------------|------:|------:|---------:|-------:|------:|
|
26 |
|Marcoro14-7B-slerp | 44.66| 76.24| 64.15| 45.64| 57.67|
|
27 |
|OpenHermes-2.5-Mistral-7B| 43.07| 73.12| 53.04| 40.96| 52.57|
|
28 |
|Change | +1.59| +3.12| +11.11| +4.68| +5.1|
|
29 |
|
30 |
+
### AGIEval
|
31 |
| Task |Version| Metric |Value| |Stderr|
|
32 |
|------------------------------|------:|--------|----:|---|-----:|
|
33 |
|agieval_aqua_rat | 0|acc |26.38|± | 2.77|
|
|
|
46 |
| | |acc_norm|45.63|± | 3.48|
|
47 |
|agieval_sat_math | 0|acc |33.18|± | 3.18|
|
48 |
| | |acc_norm|30.45|± | 3.11|
|
49 |
+
|
50 |
Average: 44.66%
|
51 |
|
52 |
### GPT4ALL
|
|
|
64 |
|piqa | 0|acc |82.59|± | 0.88|
|
65 |
| | |acc_norm|84.39|± | 0.85|
|
66 |
|winogrande | 0|acc |78.53|± | 1.15|
|
67 |
+
|
68 |
Average: 76.24%
|
69 |
|
70 |
+
### TruthfulQA
|
71 |
| Task |Version|Metric|Value| |Stderr|
|
72 |
|-------------|------:|------|----:|---|-----:|
|
73 |
|truthfulqa_mc| 1|mc1 |46.88|± | 1.75|
|
74 |
| | |mc2 |64.15|± | 1.52|
|
75 |
+
|
76 |
Average: 64.15%
|
77 |
|
78 |
+
### Bigbench
|
79 |
| Task |Version| Metric |Value| |Stderr|
|
80 |
|------------------------------------------------|------:|---------------------|----:|---|-----:|
|
81 |
|bigbench_causal_judgement | 0|multiple_choice_grade|56.32|± | 3.61|
|
|
|
97 |
|bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|23.44|± | 1.20|
|
98 |
|bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|18.51|± | 0.93|
|
99 |
|bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|52.33|± | 2.89|
|
100 |
+
|
101 |
Average: 45.64%
|
102 |
|
103 |
Average score: 57.67%
|