teknium
/

OpenHermes-13B

@@ -42,7 +42,83 @@ The WANDB Project is public and can be examined at this link: https://wandb.ai/t
 ## Benchmark Information
-More information needed
 ## Training procedure

 ## Benchmark Information
+## Benchmark Results
+GPT-4All Benchmark Set
+```
+|    Task     |Version| Metric |Value |   |Stderr|
+|-------------|------:|--------|-----:|---|-----:|
+|arc_challenge|      0|acc     |0.5009|±  |0.0146|
+|             |       |acc_norm|0.5247|±  |0.0146|
+|arc_easy     |      0|acc     |0.8127|±  |0.0080|
+|             |       |acc_norm|0.7854|±  |0.0084|
+|boolq        |      1|acc     |0.8153|±  |0.0068|
+|hellaswag    |      0|acc     |0.6126|±  |0.0049|
+|             |       |acc_norm|0.7995|±  |0.0040|
+|openbookqa   |      0|acc     |0.3660|±  |0.0216|
+|             |       |acc_norm|0.4600|±  |0.0223|
+|piqa         |      0|acc     |0.7922|±  |0.0095|
+|             |       |acc_norm|0.8112|±  |0.0091|
+|winogrande   |      0|acc     |0.7293|±  |0.0125|
+```
+AGI-Eval
+```
+|             Task             |Version| Metric |Value |   |Stderr|
+|------------------------------|------:|--------|-----:|---|-----:|
+|agieval_aqua_rat              |      0|acc     |0.2008|±  |0.0252|
+|                              |       |acc_norm|0.2126|±  |0.0257|
+|agieval_logiqa_en             |      0|acc     |0.3410|±  |0.0186|
+|                              |       |acc_norm|0.3564|±  |0.0188|
+|agieval_lsat_ar               |      0|acc     |0.2261|±  |0.0276|
+|                              |       |acc_norm|0.2174|±  |0.0273|
+|agieval_lsat_lr               |      0|acc     |0.3725|±  |0.0214|
+|                              |       |acc_norm|0.3373|±  |0.0210|
+|agieval_lsat_rc               |      0|acc     |0.4684|±  |0.0305|
+|                              |       |acc_norm|0.4572|±  |0.0304|
+|agieval_sat_en                |      0|acc     |0.6553|±  |0.0332|
+|                              |       |acc_norm|0.5971|±  |0.0343|
+|agieval_sat_en_without_passage|      0|acc     |0.4515|±  |0.0348|
+|                              |       |acc_norm|0.4029|±  |0.0343|
+|agieval_sat_math              |      0|acc     |0.3273|±  |0.0317|
+|                              |       |acc_norm|0.2636|±  |0.0298|
+```
+BigBench Reasoning Test
+```
+|                      Task                      |Version|       Metric        |Value |   |Stderr|
+|------------------------------------------------|------:|---------------------|-----:|---|-----:|
+|bigbench_causal_judgement                       |      0|multiple_choice_grade|0.5368|±  |0.0363|
+|bigbench_date_understanding                     |      0|multiple_choice_grade|0.7127|±  |0.0236|
+|bigbench_disambiguation_qa                      |      0|multiple_choice_grade|0.3023|±  |0.0286|
+|bigbench_geometric_shapes                       |      0|multiple_choice_grade|0.1003|±  |0.0159|
+|                                                |       |exact_str_match      |0.0000|±  |0.0000|
+|bigbench_logical_deduction_five_objects         |      0|multiple_choice_grade|0.2720|±  |0.0199|
+|bigbench_logical_deduction_seven_objects        |      0|multiple_choice_grade|0.1986|±  |0.0151|
+|bigbench_logical_deduction_three_objects        |      0|multiple_choice_grade|0.4500|±  |0.0288|
+|bigbench_movie_recommendation                   |      0|multiple_choice_grade|0.2880|±  |0.0203|
+|bigbench_navigate                               |      0|multiple_choice_grade|0.5000|±  |0.0158|
+|bigbench_reasoning_about_colored_objects        |      0|multiple_choice_grade|0.5390|±  |0.0111|
+|bigbench_ruin_names                             |      0|multiple_choice_grade|0.3906|±  |0.0231|
+|bigbench_salient_translation_error_detection    |      0|multiple_choice_grade|0.1844|±  |0.0123|
+|bigbench_snarks                                 |      0|multiple_choice_grade|0.5249|±  |0.0372|
+|bigbench_sports_understanding                   |      0|multiple_choice_grade|0.5335|±  |0.0159|
+|bigbench_temporal_sequences                     |      0|multiple_choice_grade|0.2980|±  |0.0145|
+|bigbench_tracking_shuffled_objects_five_objects |      0|multiple_choice_grade|0.2048|±  |0.0114|
+|bigbench_tracking_shuffled_objects_seven_objects|      0|multiple_choice_grade|0.1297|±  |0.0080|
+|bigbench_tracking_shuffled_objects_three_objects|      0|multiple_choice_grade|0.4500|±  |0.0288|
+```
+This is a slight improvement on GPT4ALL Suite and BigBench Suite, with a degredation in AGIEval compared to the original hermes.
+Average Score Comparison between Nous-Hermes Llama-2 and OpenHermes Llama-2:
+```
+|             Bench            | Nous-Hermes | OpenHermes | Change |
+|------------------------------|------------:|------------|--------|
+|GPT4All                       |        70.00|       70.36|   +0.36|
+|------------------------------------------------------------------|
+|BigBench                      |        36.57|       36.75|   +0.18|
+|------------------------------------------------------------------|
+|AGI Eval                      |        37.20|       35.56|   -1.64|
+```
 ## Training procedure