teknium commited on
Commit
04abfdf
1 Parent(s): e3fe64b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +77 -1
README.md CHANGED
@@ -42,7 +42,83 @@ The WANDB Project is public and can be examined at this link: https://wandb.ai/t
42
 
43
  ## Benchmark Information
44
 
45
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
 
47
  ## Training procedure
48
 
 
42
 
43
  ## Benchmark Information
44
 
45
+ ## Benchmark Results
46
+
47
+ GPT-4All Benchmark Set
48
+ ```
49
+ | Task |Version| Metric |Value | |Stderr|
50
+ |-------------|------:|--------|-----:|---|-----:|
51
+ |arc_challenge| 0|acc |0.5009|± |0.0146|
52
+ | | |acc_norm|0.5247|± |0.0146|
53
+ |arc_easy | 0|acc |0.8127|± |0.0080|
54
+ | | |acc_norm|0.7854|± |0.0084|
55
+ |boolq | 1|acc |0.8153|± |0.0068|
56
+ |hellaswag | 0|acc |0.6126|± |0.0049|
57
+ | | |acc_norm|0.7995|± |0.0040|
58
+ |openbookqa | 0|acc |0.3660|± |0.0216|
59
+ | | |acc_norm|0.4600|± |0.0223|
60
+ |piqa | 0|acc |0.7922|± |0.0095|
61
+ | | |acc_norm|0.8112|± |0.0091|
62
+ |winogrande | 0|acc |0.7293|± |0.0125|
63
+ ```
64
+ AGI-Eval
65
+ ```
66
+ | Task |Version| Metric |Value | |Stderr|
67
+ |------------------------------|------:|--------|-----:|---|-----:|
68
+ |agieval_aqua_rat | 0|acc |0.2008|± |0.0252|
69
+ | | |acc_norm|0.2126|± |0.0257|
70
+ |agieval_logiqa_en | 0|acc |0.3410|± |0.0186|
71
+ | | |acc_norm|0.3564|± |0.0188|
72
+ |agieval_lsat_ar | 0|acc |0.2261|± |0.0276|
73
+ | | |acc_norm|0.2174|± |0.0273|
74
+ |agieval_lsat_lr | 0|acc |0.3725|± |0.0214|
75
+ | | |acc_norm|0.3373|± |0.0210|
76
+ |agieval_lsat_rc | 0|acc |0.4684|± |0.0305|
77
+ | | |acc_norm|0.4572|± |0.0304|
78
+ |agieval_sat_en | 0|acc |0.6553|± |0.0332|
79
+ | | |acc_norm|0.5971|± |0.0343|
80
+ |agieval_sat_en_without_passage| 0|acc |0.4515|± |0.0348|
81
+ | | |acc_norm|0.4029|± |0.0343|
82
+ |agieval_sat_math | 0|acc |0.3273|± |0.0317|
83
+ | | |acc_norm|0.2636|± |0.0298|
84
+ ```
85
+ BigBench Reasoning Test
86
+ ```
87
+ | Task |Version| Metric |Value | |Stderr|
88
+ |------------------------------------------------|------:|---------------------|-----:|---|-----:|
89
+ |bigbench_causal_judgement | 0|multiple_choice_grade|0.5368|± |0.0363|
90
+ |bigbench_date_understanding | 0|multiple_choice_grade|0.7127|± |0.0236|
91
+ |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.3023|± |0.0286|
92
+ |bigbench_geometric_shapes | 0|multiple_choice_grade|0.1003|± |0.0159|
93
+ | | |exact_str_match |0.0000|± |0.0000|
94
+ |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.2720|± |0.0199|
95
+ |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.1986|± |0.0151|
96
+ |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.4500|± |0.0288|
97
+ |bigbench_movie_recommendation | 0|multiple_choice_grade|0.2880|± |0.0203|
98
+ |bigbench_navigate | 0|multiple_choice_grade|0.5000|± |0.0158|
99
+ |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.5390|± |0.0111|
100
+ |bigbench_ruin_names | 0|multiple_choice_grade|0.3906|± |0.0231|
101
+ |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.1844|± |0.0123|
102
+ |bigbench_snarks | 0|multiple_choice_grade|0.5249|± |0.0372|
103
+ |bigbench_sports_understanding | 0|multiple_choice_grade|0.5335|± |0.0159|
104
+ |bigbench_temporal_sequences | 0|multiple_choice_grade|0.2980|± |0.0145|
105
+ |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2048|± |0.0114|
106
+ |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1297|± |0.0080|
107
+ |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.4500|± |0.0288|
108
+ ```
109
+
110
+ This is a slight improvement on GPT4ALL Suite and BigBench Suite, with a degredation in AGIEval compared to the original hermes.
111
+
112
+ Average Score Comparison between Nous-Hermes Llama-2 and OpenHermes Llama-2:
113
+ ```
114
+ | Bench | Nous-Hermes | OpenHermes | Change |
115
+ |------------------------------|------------:|------------|--------|
116
+ |GPT4All | 70.00| 70.36| +0.36|
117
+ |------------------------------------------------------------------|
118
+ |BigBench | 36.57| 36.75| +0.18|
119
+ |------------------------------------------------------------------|
120
+ |AGI Eval | 37.20| 35.56| -1.64|
121
+ ```
122
 
123
  ## Training procedure
124