DavidGF commited on
Commit
a141fc0
1 Parent(s): a7c4675

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +87 -0
README.md CHANGED
@@ -86,6 +86,93 @@ benchmarked on lm-evaluation-harness 0.4.1
86
  | Winogrande (5-shot) | 80.74 |
87
  | GSM8K (5-shot) | 74.15 |
88
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
 
90
 
91
  ## Disclaimer
 
86
  | Winogrande (5-shot) | 80.74 |
87
  | GSM8K (5-shot) | 74.15 |
88
 
89
+ **Performance**
90
+
91
+ | Model |AGIEval|GPT4All|TruthfulQA|BigBench|Average ⬇️|
92
+ |-----------------------------------------------------------------------|------:|------:|---------:|-------:|------:|
93
+ |[VAGOsolutions/SauerkrautLM-14b-MoE-LaserChat](https://huggingface.co/VAGOsolutions/SauerkrautLM-14b-MoE-LaserChat) | 44.38| 74.76| 58.57| 47.98| 56.42|
94
+ |[VAGOsolutions/SauerkrautLM-Gemma-7b](https://huggingface.co/VAGOsolutions/SauerkrautLM-Gemma-7b) | 37.5| 72.46| 61.24| 45.33| 54.13|
95
+ |[zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) | 37.52| 71.77| 55.26| 39.77| 51.08|
96
+ |[zephyr-7b-gemma-v0.1](https://huggingface.co/HuggingFaceH4/zephyr-7b-gemma-v0.1)| 34.22| 66.37| 52.19| 37.10| 47.47|
97
+ |[google/gemma-7b-it](https://huggingface.co/google/gemma-7b-it) | 21.33| 40.84| 41.70| 30.25| 33.53|
98
+
99
+
100
+ <details><summary>Details of AGIEval, GPT4All, TruthfulQA, BigBench </summary>
101
+
102
+ **AGIEval**
103
+ | Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
104
+ |------------------------------|------:|------|------|--------|-----:|---|-----:|
105
+ |agieval_sat_math | 1|none |None |acc |0.3727|± |0.0327|
106
+ | | |none |None |acc_norm|0.3045|± |0.0311|
107
+ |agieval_sat_en_without_passage| 1|none |None |acc |0.4806|± |0.0349|
108
+ | | |none |None |acc_norm|0.4612|± |0.0348|
109
+ |agieval_sat_en | 1|none |None |acc |0.7816|± |0.0289|
110
+ | | |none |None |acc_norm|0.7621|± |0.0297|
111
+ |agieval_lsat_rc | 1|none |None |acc |0.6134|± |0.0297|
112
+ | | |none |None |acc_norm|0.6059|± |0.0298|
113
+ |agieval_lsat_lr | 1|none |None |acc |0.5431|± |0.0221|
114
+ | | |none |None |acc_norm|0.5216|± |0.0221|
115
+ |agieval_lsat_ar | 1|none |None |acc |0.2435|± |0.0284|
116
+ | | |none |None |acc_norm|0.2174|± |0.0273|
117
+ |agieval_logiqa_en | 1|none |None |acc |0.3871|± |0.0191|
118
+ | | |none |None |acc_norm|0.4101|± |0.0193|
119
+ |agieval_aqua_rat | 1|none |None |acc |0.3031|± |0.0289|
120
+ | | |none |None |acc_norm|0.2677|± |0.0278|
121
+
122
+ Average: 44.38%
123
+
124
+ **GPT4All**
125
+ | Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
126
+ |---------|------:|------|------|--------|-----:|---|-----:|
127
+ |arc_challenge| 1|none |None |acc |0.5947|± |0.0143|
128
+ | | |none |None |acc_norm|0.6280|± |0.0141|
129
+ |arc_easy | 1|none |None |acc |0.8506|± |0.0073|
130
+ | | |none |None |acc_norm|0.8468|± |0.0074|
131
+ |boolq | 2|none |None |acc |0.8761|± |0.0058|
132
+ |hellaswag | 1|none |None |acc |0.6309|± |0.0048|
133
+ | | |none |None |acc_norm|0.8323|± |0.0037|
134
+ |openbookqa | 1|none |None |acc |0.326 |± |0.0210|
135
+ | | |none |None |acc_norm|0.470| ± |0.0223
136
+ |piqa | 1|none |None |acc |0.8237|± |0.0089|
137
+ | | |none |None |acc_norm|0.8335|± |0.0087|
138
+ |winogrande | 1|none |None |acc |0.7466|± |0.0122|
139
+
140
+ Average: 74.76%
141
+
142
+ **TruthfulQA**
143
+ | Tasks |Version|Filter|n-shot|Metric|Value | |Stderr|
144
+ |--------------|------:|------|-----:|------|-----:|---|-----:|
145
+ |truthfulqa_mc2| 2|none | 0|acc |0.5857|± |0.0141|
146
+
147
+
148
+ Average: 58.57%
149
+
150
+ **Bigbench**
151
+ | Tasks |Version| Filter |n-shot| Metric |Value | |Stderr|
152
+ |----------------------------------------------------|------:|----------------|-----:|-----------|-----:|---|-----:|
153
+ |bbh_zeroshot_tracking_shuffled_objects_three_objects| 2|flexible-extract| 0|exact_match|0.3120|± |0.0294|
154
+ |bbh_zeroshot_tracking_shuffled_objects_seven_objects| 2|flexible-extract| 0|exact_match|0.1560|± |0.0230|
155
+ |bbh_zeroshot_tracking_shuffled_objects_five_objects | 2|flexible-extract| 0|exact_match|0.1720|± |0.0239|
156
+ |bbh_zeroshot_temporal_sequences | 2|flexible-extract| 0|exact_match|0.3960|± |0.0310|
157
+ |bbh_zeroshot_sports_understanding | 2|flexible-extract| 0|exact_match|0.8120|± |0.0248|
158
+ |bbh_zeroshot_snarks | 2|flexible-extract| 0|exact_match|0.5843|± |0.0370|
159
+ |bbh_zeroshot_salient_translation_error_detection | 2|flexible-extract| 0|exact_match|0.4640|± |0.0316|
160
+ |bbh_zeroshot_ruin_names | 2|flexible-extract| 0|exact_match|0.4360|± |0.0314|
161
+ |bbh_zeroshot_reasoning_about_colored_objects | 2|flexible-extract| 0|exact_match|0.5520|± |0.0315|
162
+ |bbh_zeroshot_navigate | 2|flexible-extract| 0|exact_match|0.5800|± |0.0313|
163
+ |bbh_zeroshot_movie_recommendation | 2|flexible-extract| 0|exact_match|0.7320|± |0.0281|
164
+ |bbh_zeroshot_logical_deduction_three_objects | 2|flexible-extract| 0|exact_match|0.5680|± |0.0314|
165
+ |bbh_zeroshot_logical_deduction_seven_objects | 2|flexible-extract| 0|exact_match|0.3920|± |0.0309|
166
+ |bbh_zeroshot_logical_deduction_five_objects | 2|flexible-extract| 0|exact_match|0.3960|± |0.0310|
167
+ |bbh_zeroshot_geometric_shapes | 2|flexible-extract| 0|exact_match|0.3800|± |0.0308|
168
+ |bbh_zeroshot_disambiguation_qa | 2|flexible-extract| 0|exact_match|0.6760|± |0.0297|
169
+ |bbh_zeroshot_date_understanding | 2|flexible-extract| 0|exact_match|0.4400|± |0.0315|
170
+ |bbh_zeroshot_causal_judgement | 2|flexible-extract| 0|exact_match|0.5882|± |0.0361|
171
+
172
+ Average: 47.98%
173
+
174
+ </details>
175
+
176
 
177
 
178
  ## Disclaimer