Update README.md

Browse files

Files changed (1) hide show

README.md +87 -0

README.md CHANGED Viewed

@@ -86,6 +86,93 @@ benchmarked on lm-evaluation-harness 0.4.1
 | Winogrande (5-shot)   | 80.74  |
 | GSM8K (5-shot)        | 74.15        |
 ## Disclaimer

 | Winogrande (5-shot)   | 80.74  |
 | GSM8K (5-shot)        | 74.15        |
+**Performance**
+|                                 Model                                 |AGIEval|GPT4All|TruthfulQA|BigBench|Average ⬇️|
+|-----------------------------------------------------------------------|------:|------:|---------:|-------:|------:|
+|[VAGOsolutions/SauerkrautLM-14b-MoE-LaserChat](https://huggingface.co/VAGOsolutions/SauerkrautLM-14b-MoE-LaserChat)  |  44.38|  74.76|     58.57|   47.98|  56.42|
+|[VAGOsolutions/SauerkrautLM-Gemma-7b](https://huggingface.co/VAGOsolutions/SauerkrautLM-Gemma-7b)  |  37.5|  72.46|     61.24|   45.33|  54.13|
+|[zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta)  |  37.52|  71.77|     55.26|   39.77|  51.08|
+|[zephyr-7b-gemma-v0.1](https://huggingface.co/HuggingFaceH4/zephyr-7b-gemma-v0.1)|  34.22|  66.37|     52.19|   37.10|  47.47|
+|[google/gemma-7b-it](https://huggingface.co/google/gemma-7b-it)        |  21.33|  40.84|     41.70|   30.25|  33.53|
+<details><summary>Details of AGIEval, GPT4All, TruthfulQA, BigBench </summary>
+**AGIEval**
+|            Tasks             |Version|Filter|n-shot| Metric |Value |   |Stderr|
+|------------------------------|------:|------|------|--------|-----:|---|-----:|
+|agieval_sat_math              |      1|none  |None  |acc     |0.3727|±  |0.0327|
+|                              |       |none  |None  |acc_norm|0.3045|±  |0.0311|
+|agieval_sat_en_without_passage|      1|none  |None  |acc     |0.4806|±  |0.0349|
+|                              |       |none  |None  |acc_norm|0.4612|±  |0.0348|
+|agieval_sat_en                |      1|none  |None  |acc     |0.7816|±  |0.0289|
+|                              |       |none  |None  |acc_norm|0.7621|±  |0.0297|
+|agieval_lsat_rc               |      1|none  |None  |acc     |0.6134|±  |0.0297|
+|                              |       |none  |None  |acc_norm|0.6059|±  |0.0298|
+|agieval_lsat_lr               |      1|none  |None  |acc     |0.5431|±  |0.0221|
+|                              |       |none  |None  |acc_norm|0.5216|±  |0.0221|
+|agieval_lsat_ar               |      1|none  |None  |acc     |0.2435|±  |0.0284|
+|                              |       |none  |None  |acc_norm|0.2174|±  |0.0273|
+|agieval_logiqa_en             |      1|none  |None  |acc     |0.3871|±  |0.0191|
+|                              |       |none  |None  |acc_norm|0.4101|±  |0.0193|
+|agieval_aqua_rat              |      1|none  |None  |acc     |0.3031|±  |0.0289|
+|                              |       |none  |None  |acc_norm|0.2677|±  |0.0278|
+Average: 44.38%
+**GPT4All**
+|  Tasks  |Version|Filter|n-shot| Metric |Value |   |Stderr|
+|---------|------:|------|------|--------|-----:|---|-----:|
+|arc_challenge|      1|none  |None  |acc     |0.5947|±  |0.0143|
+|             |       |none  |None  |acc_norm|0.6280|±  |0.0141|
+|arc_easy     |      1|none  |None  |acc     |0.8506|±  |0.0073|
+|             |       |none  |None  |acc_norm|0.8468|±  |0.0074|
+|boolq        |      2|none  |None  |acc     |0.8761|±  |0.0058|
+|hellaswag    |      1|none  |None  |acc     |0.6309|±  |0.0048|
+|             |       |none  |None  |acc_norm|0.8323|±  |0.0037|
+|openbookqa   |      1|none  |None  |acc     |0.326 |±  |0.0210|
+|             |       |none  |None  |acc_norm|0.470| ±  |0.0223
+|piqa         |      1|none  |None  |acc     |0.8237|±  |0.0089|
+|             |       |none  |None  |acc_norm|0.8335|±  |0.0087|
+|winogrande   |      1|none  |None  |acc     |0.7466|±  |0.0122|
+Average: 74.76%
+**TruthfulQA**
+|    Tasks     |Version|Filter|n-shot|Metric|Value |   |Stderr|
+|--------------|------:|------|-----:|------|-----:|---|-----:|
+|truthfulqa_mc2|      2|none  |     0|acc   |0.5857|±  |0.0141|
+Average: 58.57%
+**Bigbench**
+|                       Tasks                        |Version|     Filter     |n-shot|  Metric   |Value |   |Stderr|
+|----------------------------------------------------|------:|----------------|-----:|-----------|-----:|---|-----:|
+|bbh_zeroshot_tracking_shuffled_objects_three_objects|      2|flexible-extract|     0|exact_match|0.3120|±  |0.0294|
+|bbh_zeroshot_tracking_shuffled_objects_seven_objects|      2|flexible-extract|     0|exact_match|0.1560|±  |0.0230|
+|bbh_zeroshot_tracking_shuffled_objects_five_objects |      2|flexible-extract|     0|exact_match|0.1720|±  |0.0239|
+|bbh_zeroshot_temporal_sequences                     |      2|flexible-extract|     0|exact_match|0.3960|±  |0.0310|
+|bbh_zeroshot_sports_understanding                   |      2|flexible-extract|     0|exact_match|0.8120|±  |0.0248|
+|bbh_zeroshot_snarks                                 |      2|flexible-extract|     0|exact_match|0.5843|±  |0.0370|
+|bbh_zeroshot_salient_translation_error_detection    |      2|flexible-extract|     0|exact_match|0.4640|±  |0.0316|
+|bbh_zeroshot_ruin_names                             |      2|flexible-extract|     0|exact_match|0.4360|±  |0.0314|
+|bbh_zeroshot_reasoning_about_colored_objects        |      2|flexible-extract|     0|exact_match|0.5520|±  |0.0315|
+|bbh_zeroshot_navigate                               |      2|flexible-extract|     0|exact_match|0.5800|±  |0.0313|
+|bbh_zeroshot_movie_recommendation                   |      2|flexible-extract|     0|exact_match|0.7320|±  |0.0281|
+|bbh_zeroshot_logical_deduction_three_objects        |      2|flexible-extract|     0|exact_match|0.5680|±  |0.0314|
+|bbh_zeroshot_logical_deduction_seven_objects        |      2|flexible-extract|     0|exact_match|0.3920|±  |0.0309|
+|bbh_zeroshot_logical_deduction_five_objects         |      2|flexible-extract|     0|exact_match|0.3960|±  |0.0310|
+|bbh_zeroshot_geometric_shapes                       |      2|flexible-extract|     0|exact_match|0.3800|±  |0.0308|
+|bbh_zeroshot_disambiguation_qa                      |      2|flexible-extract|     0|exact_match|0.6760|±  |0.0297|
+|bbh_zeroshot_date_understanding                     |      2|flexible-extract|     0|exact_match|0.4400|±  |0.0315|
+|bbh_zeroshot_causal_judgement                       |      2|flexible-extract|     0|exact_match|0.5882|±  |0.0361|
+Average: 47.98%
+</details>
 ## Disclaimer