Update README.md
Browse files
README.md
CHANGED
@@ -86,6 +86,93 @@ benchmarked on lm-evaluation-harness 0.4.1
|
|
86 |
| Winogrande (5-shot) | 80.74 |
|
87 |
| GSM8K (5-shot) | 74.15 |
|
88 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
89 |
|
90 |
|
91 |
## Disclaimer
|
|
|
86 |
| Winogrande (5-shot) | 80.74 |
|
87 |
| GSM8K (5-shot) | 74.15 |
|
88 |
|
89 |
+
**Performance**
|
90 |
+
|
91 |
+
| Model |AGIEval|GPT4All|TruthfulQA|BigBench|Average ⬇️|
|
92 |
+
|-----------------------------------------------------------------------|------:|------:|---------:|-------:|------:|
|
93 |
+
|[VAGOsolutions/SauerkrautLM-14b-MoE-LaserChat](https://huggingface.co/VAGOsolutions/SauerkrautLM-14b-MoE-LaserChat) | 44.38| 74.76| 58.57| 47.98| 56.42|
|
94 |
+
|[VAGOsolutions/SauerkrautLM-Gemma-7b](https://huggingface.co/VAGOsolutions/SauerkrautLM-Gemma-7b) | 37.5| 72.46| 61.24| 45.33| 54.13|
|
95 |
+
|[zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) | 37.52| 71.77| 55.26| 39.77| 51.08|
|
96 |
+
|[zephyr-7b-gemma-v0.1](https://huggingface.co/HuggingFaceH4/zephyr-7b-gemma-v0.1)| 34.22| 66.37| 52.19| 37.10| 47.47|
|
97 |
+
|[google/gemma-7b-it](https://huggingface.co/google/gemma-7b-it) | 21.33| 40.84| 41.70| 30.25| 33.53|
|
98 |
+
|
99 |
+
|
100 |
+
<details><summary>Details of AGIEval, GPT4All, TruthfulQA, BigBench </summary>
|
101 |
+
|
102 |
+
**AGIEval**
|
103 |
+
| Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
|
104 |
+
|------------------------------|------:|------|------|--------|-----:|---|-----:|
|
105 |
+
|agieval_sat_math | 1|none |None |acc |0.3727|± |0.0327|
|
106 |
+
| | |none |None |acc_norm|0.3045|± |0.0311|
|
107 |
+
|agieval_sat_en_without_passage| 1|none |None |acc |0.4806|± |0.0349|
|
108 |
+
| | |none |None |acc_norm|0.4612|± |0.0348|
|
109 |
+
|agieval_sat_en | 1|none |None |acc |0.7816|± |0.0289|
|
110 |
+
| | |none |None |acc_norm|0.7621|± |0.0297|
|
111 |
+
|agieval_lsat_rc | 1|none |None |acc |0.6134|± |0.0297|
|
112 |
+
| | |none |None |acc_norm|0.6059|± |0.0298|
|
113 |
+
|agieval_lsat_lr | 1|none |None |acc |0.5431|± |0.0221|
|
114 |
+
| | |none |None |acc_norm|0.5216|± |0.0221|
|
115 |
+
|agieval_lsat_ar | 1|none |None |acc |0.2435|± |0.0284|
|
116 |
+
| | |none |None |acc_norm|0.2174|± |0.0273|
|
117 |
+
|agieval_logiqa_en | 1|none |None |acc |0.3871|± |0.0191|
|
118 |
+
| | |none |None |acc_norm|0.4101|± |0.0193|
|
119 |
+
|agieval_aqua_rat | 1|none |None |acc |0.3031|± |0.0289|
|
120 |
+
| | |none |None |acc_norm|0.2677|± |0.0278|
|
121 |
+
|
122 |
+
Average: 44.38%
|
123 |
+
|
124 |
+
**GPT4All**
|
125 |
+
| Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
|
126 |
+
|---------|------:|------|------|--------|-----:|---|-----:|
|
127 |
+
|arc_challenge| 1|none |None |acc |0.5947|± |0.0143|
|
128 |
+
| | |none |None |acc_norm|0.6280|± |0.0141|
|
129 |
+
|arc_easy | 1|none |None |acc |0.8506|± |0.0073|
|
130 |
+
| | |none |None |acc_norm|0.8468|± |0.0074|
|
131 |
+
|boolq | 2|none |None |acc |0.8761|± |0.0058|
|
132 |
+
|hellaswag | 1|none |None |acc |0.6309|± |0.0048|
|
133 |
+
| | |none |None |acc_norm|0.8323|± |0.0037|
|
134 |
+
|openbookqa | 1|none |None |acc |0.326 |± |0.0210|
|
135 |
+
| | |none |None |acc_norm|0.470| ± |0.0223
|
136 |
+
|piqa | 1|none |None |acc |0.8237|± |0.0089|
|
137 |
+
| | |none |None |acc_norm|0.8335|± |0.0087|
|
138 |
+
|winogrande | 1|none |None |acc |0.7466|± |0.0122|
|
139 |
+
|
140 |
+
Average: 74.76%
|
141 |
+
|
142 |
+
**TruthfulQA**
|
143 |
+
| Tasks |Version|Filter|n-shot|Metric|Value | |Stderr|
|
144 |
+
|--------------|------:|------|-----:|------|-----:|---|-----:|
|
145 |
+
|truthfulqa_mc2| 2|none | 0|acc |0.5857|± |0.0141|
|
146 |
+
|
147 |
+
|
148 |
+
Average: 58.57%
|
149 |
+
|
150 |
+
**Bigbench**
|
151 |
+
| Tasks |Version| Filter |n-shot| Metric |Value | |Stderr|
|
152 |
+
|----------------------------------------------------|------:|----------------|-----:|-----------|-----:|---|-----:|
|
153 |
+
|bbh_zeroshot_tracking_shuffled_objects_three_objects| 2|flexible-extract| 0|exact_match|0.3120|± |0.0294|
|
154 |
+
|bbh_zeroshot_tracking_shuffled_objects_seven_objects| 2|flexible-extract| 0|exact_match|0.1560|± |0.0230|
|
155 |
+
|bbh_zeroshot_tracking_shuffled_objects_five_objects | 2|flexible-extract| 0|exact_match|0.1720|± |0.0239|
|
156 |
+
|bbh_zeroshot_temporal_sequences | 2|flexible-extract| 0|exact_match|0.3960|± |0.0310|
|
157 |
+
|bbh_zeroshot_sports_understanding | 2|flexible-extract| 0|exact_match|0.8120|± |0.0248|
|
158 |
+
|bbh_zeroshot_snarks | 2|flexible-extract| 0|exact_match|0.5843|± |0.0370|
|
159 |
+
|bbh_zeroshot_salient_translation_error_detection | 2|flexible-extract| 0|exact_match|0.4640|± |0.0316|
|
160 |
+
|bbh_zeroshot_ruin_names | 2|flexible-extract| 0|exact_match|0.4360|± |0.0314|
|
161 |
+
|bbh_zeroshot_reasoning_about_colored_objects | 2|flexible-extract| 0|exact_match|0.5520|± |0.0315|
|
162 |
+
|bbh_zeroshot_navigate | 2|flexible-extract| 0|exact_match|0.5800|± |0.0313|
|
163 |
+
|bbh_zeroshot_movie_recommendation | 2|flexible-extract| 0|exact_match|0.7320|± |0.0281|
|
164 |
+
|bbh_zeroshot_logical_deduction_three_objects | 2|flexible-extract| 0|exact_match|0.5680|± |0.0314|
|
165 |
+
|bbh_zeroshot_logical_deduction_seven_objects | 2|flexible-extract| 0|exact_match|0.3920|± |0.0309|
|
166 |
+
|bbh_zeroshot_logical_deduction_five_objects | 2|flexible-extract| 0|exact_match|0.3960|± |0.0310|
|
167 |
+
|bbh_zeroshot_geometric_shapes | 2|flexible-extract| 0|exact_match|0.3800|± |0.0308|
|
168 |
+
|bbh_zeroshot_disambiguation_qa | 2|flexible-extract| 0|exact_match|0.6760|± |0.0297|
|
169 |
+
|bbh_zeroshot_date_understanding | 2|flexible-extract| 0|exact_match|0.4400|± |0.0315|
|
170 |
+
|bbh_zeroshot_causal_judgement | 2|flexible-extract| 0|exact_match|0.5882|± |0.0361|
|
171 |
+
|
172 |
+
Average: 47.98%
|
173 |
+
|
174 |
+
</details>
|
175 |
+
|
176 |
|
177 |
|
178 |
## Disclaimer
|