DavidGF commited on
Commit
e0d79b5
1 Parent(s): d50b59f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +88 -0
README.md CHANGED
@@ -16,6 +16,7 @@ tags:
16
  ---
17
  **Update**
18
  - 01.03.2024 - Reuploaded the model in bfloat16 dtype.
 
19
 
20
  ![SauerkrautLM](https://vago-solutions.de/wp-content/uploads/2024/02/sauerkrautgemma.jpeg "SauerkrautLM-Gemma-7b")
21
  ## VAGO solutions SauerkrautLM-Gemma-7b (alpha)
@@ -105,6 +106,93 @@ ASSISTANT:
105
  | Winogrande (5-shot) | 76.64 |
106
  | GSM8K (5-shot) | 63.68 |
107
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
108
  Despite the fact that we archived great results on the Open LLM leaderboard benchmarks the model subjectively does not feel as smart as comparable mistral finetunes. Most of its answers are coherent but we observed that the model sometimes answers realy lazy or odd.
109
 
110
  ## Disclaimer
 
16
  ---
17
  **Update**
18
  - 01.03.2024 - Reuploaded the model in bfloat16 dtype.
19
+ - 02.03.2024 - **strongest Gemma finetune model so far**: added AGIEval,GPT4ALL and Bigbench
20
 
21
  ![SauerkrautLM](https://vago-solutions.de/wp-content/uploads/2024/02/sauerkrautgemma.jpeg "SauerkrautLM-Gemma-7b")
22
  ## VAGO solutions SauerkrautLM-Gemma-7b (alpha)
 
106
  | Winogrande (5-shot) | 76.64 |
107
  | GSM8K (5-shot) | 63.68 |
108
 
109
+ **Performance**
110
+
111
+ | Model |AGIEval|GPT4All|TruthfulQA|BigBench|Average ⬇️|
112
+ |-----------------------------------------------------------------------|------:|------:|---------:|-------:|------:|
113
+ |[VAGOsolutions/SauerkrautLM-Gemma-7b](https://huggingface.co/VAGOsolutions/SauerkrautLM-Gemma-7b) | 37.5| 72.46| 61.24| 45.33| 54.13|
114
+ |[zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) | 37.52| 71.77| 55.26| 39.77| 51.08|
115
+ |[zephyr-7b-gemma-v0.1](https://huggingface.co/HuggingFaceH4/zephyr-7b-gemma-v0.1)| 34.22| 66.37| 52.19| 37.10| 47.47|
116
+ |[google/gemma-7b-it](https://huggingface.co/google/gemma-7b-it) | 21.33| 40.84| 41.70| 30.25| 33.53|
117
+
118
+
119
+ <details><summary>Details of AGIEval, GPT4All, TruthfulQA, BigBench </summary>
120
+
121
+ **AGIEval**
122
+ | Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
123
+ |------------------------------|------:|------|------|--------|-----:|---|-----:|
124
+ |agieval_sat_math | 1|none |None |acc |0.3682|± |0.0326|
125
+ | | |none |None |acc_norm|0.3364|± |0.0319|
126
+ |agieval_sat_en_without_passage| 1|none |None |acc |0.4272|± |0.0345|
127
+ | | |none |None |acc_norm|0.3738|± |0.0338|
128
+ |agieval_sat_en | 1|none |None |acc |0.7427|± |0.0305|
129
+ | | |none |None |acc_norm|0.6893|± |0.0323|
130
+ |agieval_lsat_rc | 1|none |None |acc |0.5539|± |0.0304|
131
+ | | |none |None |acc_norm|0.5167|± |0.0305|
132
+ |agieval_lsat_lr | 1|none |None |acc |0.3431|± |0.0210|
133
+ | | |none |None |acc_norm|0.3471|± |0.0211|
134
+ |agieval_lsat_ar | 1|none |None |acc |0.1913|± |0.0260|
135
+ | | |none |None |acc_norm|0.1739|± |0.0250|
136
+ |agieval_logiqa_en | 1|none |None |acc |0.3303|± |0.0184|
137
+ | | |none |None |acc_norm|0.3303|± |0.0184|
138
+ |agieval_aqua_rat | 1|none |None |acc |0.2480|± |0.0272|
139
+ | | |none |None |acc_norm|0.2323|± |0.0265|
140
+
141
+ Average: 37.5%
142
+
143
+ **GPT4All**
144
+ | Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
145
+ |---------|------:|------|------|--------|-----:|---|-----:|
146
+ |arc_challenge| 1|none |None |acc |0.5358|± |0.0146|
147
+ | | |none |None |acc_norm|0.5597|± |0.0145|
148
+ |arc_easy | 1|none |None |acc |0.8249|± |0.0078|
149
+ | | |none |None |acc_norm|0.7955|± |0.0083|
150
+ |boolq | 2|none |None |acc |0.8651|± |0.006 |
151
+ |hellaswag | 1|none |None |acc |0.6162|± |0.0049|
152
+ | | |none |None |acc_norm|0.8117|± |0.0039|
153
+ |openbookqa | 1|none |None |acc |0.336|± |0.0211|
154
+ | | |none |None |acc_norm|0.470|± |0.0223|
155
+ |piqa | 1|none |None |acc |0.7900|± |0.0095|
156
+ | | |none |None |acc_norm|0.8096|± |0.00 |
157
+ |winogrande | 1|none |None |acc |0.7609|± |0.012 |
158
+
159
+ Average: 72.46%
160
+
161
+ **TruthfulQA**
162
+ | Tasks |Version|Filter|n-shot|Metric|Value | |Stderr|
163
+ |--------------|------:|------|-----:|------|-----:|---|-----:|
164
+ |truthfulqa_mc2| 2|none | 0|acc |0.6124|± |0.0148|
165
+
166
+
167
+ Average: 61.24%
168
+
169
+ **Bigbench**
170
+ | Tasks |Version| Filter |n-shot| Metric |Value | |Stderr|
171
+ |----------------------------------------------------|------:|----------------|-----:|-----------|-----:|---|-----:|
172
+ |bbh_zeroshot_tracking_shuffled_objects_three_objects| 2|flexible-extract| 0|exact_match|0.2760|± |0.0283|
173
+ |bbh_zeroshot_tracking_shuffled_objects_seven_objects| 2||flexible-extract| 0|exact_match|0.1280|± |0.0212|
174
+ |bbh_zeroshot_tracking_shuffled_objects_five_objects | 2|flexible-extract| 0|exact_match|0.1240|± |0.0209|
175
+ |bbh_zeroshot_temporal_sequences | 2|flexible-extract| 0|exact_match|0.4520|± |0.0315|
176
+ |bbh_zeroshot_sports_understanding | 2||flexible-extract| 0|exact_match|0.7120|± |0.0287|
177
+ |bbh_zeroshot_snarks | 2|flexible-extract| 0|exact_match|0.5056|± |0.0376|
178
+ |bbh_zeroshot_salient_translation_error_detection | 2|flexible-extract| 0|exact_match|0.4480|± |0.0315|
179
+ |bbh_zeroshot_ruin_names | 2|flexible-extract| 0|exact_match|0.4520|± |0.0315|
180
+ |bbh_zeroshot_reasoning_about_colored_objects | 2|flexible-extract| 0|exact_match|0.4800|± |0.0317|
181
+ |bbh_zeroshot_navigate | 2|flexible-extract| 0|exact_match|0.5480|± |0.0315|
182
+ |bbh_zeroshot_movie_recommendation | 2|flexible-extract| 0|exact_match|0.7000|± |0.0290|
183
+ |bbh_zeroshot_logical_deduction_three_objects | 2|flexible-extract| 0|exact_match|0.5200|± |0.0317|
184
+ |bbh_zeroshot_logical_deduction_seven_objects | 2|flexible-extract| 0|exact_match|0.4120|± |0.0312|
185
+ |bbh_zeroshot_logical_deduction_five_objects | 2|flexible-extract| 0|exact_match|0.3840|± |0.0308|
186
+ |bbh_zeroshot_geometric_shapes | 2|flexible-extract| 0|exact_match|0.2920|± |0.0288|
187
+ |bbh_zeroshot_disambiguation_qa | 2|flexible-extract| 0|exact_match|0.6480|± |0.0303|
188
+ |bbh_zeroshot_date_understanding | 2|flexible-extract| 0|exact_match|0.5000|± |0.0317|
189
+ |bbh_zeroshot_causal_judgement | 2|flexible-extract| 0|exact_match|0.5775|± |0.0362|
190
+
191
+ Average: 45.33%
192
+
193
+ </details>
194
+
195
+
196
  Despite the fact that we archived great results on the Open LLM leaderboard benchmarks the model subjectively does not feel as smart as comparable mistral finetunes. Most of its answers are coherent but we observed that the model sometimes answers realy lazy or odd.
197
 
198
  ## Disclaimer