retiredcarboxyl commited on
Commit
3b1f4ad
1 Parent(s): 2b4b041

added model comparisons

Browse files
Files changed (1) hide show
  1. README.md +87 -1
README.md CHANGED
@@ -68,4 +68,90 @@ pipeline = transformers.pipeline(
68
 
69
  outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
70
  print(outputs[0]["generated_text"])
71
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
 
69
  outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
70
  print(outputs[0]["generated_text"])
71
+ ```
72
+
73
+ ## Benchmark Results
74
+
75
+ Chat2Eco is a major improvement across the board on the benchmarks below compared to the base model, and is the first model to beat the all good benchmarks.
76
+
77
+ ## GPT4All:
78
+ ```
79
+ | Task |Version| Metric |Value | |Stderr|
80
+ |-------------|------:|--------|-----:|---|-----:|
81
+ |arc_challenge| 0|acc |0.5990|± |0.0143|
82
+ | | |acc_norm|0.6425|± |0.0140|
83
+ |arc_easy | 0|acc |0.8657|± |0.0070|
84
+ | | |acc_norm|0.8636|± |0.0070|
85
+ |boolq | 1|acc |0.8783|± |0.0057|
86
+ |hellaswag | 0|acc |0.6661|± |0.0047|
87
+ | | |acc_norm|0.8489|± |0.0036|
88
+ |openbookqa | 0|acc |0.3440|± |0.0213|
89
+ | | |acc_norm|0.4660|± |0.0223|
90
+ |piqa | 0|acc |0.8324|± |0.0087|
91
+ | | |acc_norm|0.8379|± |0.0086|
92
+ |winogrande | 0|acc |0.7616|± |0.0120|
93
+ ```
94
+ Average: 87.25
95
+
96
+ ## AGIEval:
97
+ ```
98
+ | Task |Version| Metric |Value | |Stderr|
99
+ |------------------------------|------:|--------|-----:|---|-----:|
100
+ |agieval_aqua_rat | 0|acc |0.2402|± |0.0269|
101
+ | | |acc_norm|0.2520|± |0.0273|
102
+ |agieval_logiqa_en | 0|acc |0.4117|± |0.0193|
103
+ | | |acc_norm|0.4055|± |0.0193|
104
+ |agieval_lsat_ar | 0|acc |0.2348|± |0.0280|
105
+ | | |acc_norm|0.2087|± |0.0269|
106
+ |agieval_lsat_lr | 0|acc |0.5549|± |0.0220|
107
+ | | |acc_norm|0.5294|± |0.0221|
108
+ |agieval_lsat_rc | 0|acc |0.6617|± |0.0289|
109
+ | | |acc_norm|0.6357|± |0.0294|
110
+ |agieval_sat_en | 0|acc |0.8010|± |0.0279|
111
+ | | |acc_norm|0.7913|± |0.0284|
112
+ |agieval_sat_en_without_passage| 0|acc |0.4806|± |0.0349|
113
+ | | |acc_norm|0.4612|± |0.0348|
114
+ |agieval_sat_math | 0|acc |0.4909|± |0.0338|
115
+ | | |acc_norm|0.4000|± |0.0331|
116
+ ```
117
+ Average: 89.05
118
+
119
+ ## BigBench:
120
+ ```
121
+ | Task |Version| Metric |Value | |Stderr|
122
+ |------------------------------------------------|------:|---------------------|-----:|---|-----:|
123
+ |bigbench_causal_judgement | 0|multiple_choice_grade|0.6105|± |0.0355|
124
+ |bigbench_date_understanding | 0|multiple_choice_grade|0.7182|± |0.0235|
125
+ |bigbench_disambiguation_qa | 0|multiple_choice_grade|0.5736|± |0.0308|
126
+ |bigbench_geometric_shapes | 0|multiple_choice_grade|0.4596|± |0.0263|
127
+ | | |exact_str_match |0.0000|± |0.0000|
128
+ |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|0.3500|± |0.0214|
129
+ |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|0.2500|± |0.0164|
130
+ |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|0.5200|± |0.0289|
131
+ |bigbench_movie_recommendation | 0|multiple_choice_grade|0.3540|± |0.0214|
132
+ |bigbench_navigate | 0|multiple_choice_grade|0.5000|± |0.0158|
133
+ |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|0.6900|± |0.0103|
134
+ |bigbench_ruin_names | 0|multiple_choice_grade|0.6317|± |0.0228|
135
+ |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|0.2535|± |0.0138|
136
+ |bigbench_snarks | 0|multiple_choice_grade|0.7293|± |0.0331|
137
+ |bigbench_sports_understanding | 0|multiple_choice_grade|0.6744|± |0.0149|
138
+ |bigbench_temporal_sequences | 0|multiple_choice_grade|0.7400|± |0.0139|
139
+ |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|0.2176|± |0.0117|
140
+ |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|0.1543|± |0.0086|
141
+ |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|0.5200|± |0.0289|
142
+ ```
143
+ Average: 87.45
144
+
145
+ # Benchmark Comparison Charts
146
+
147
+ ## GPT4All
148
+
149
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/HK6bSbMfxX_qzxReAcJH9.png)
150
+
151
+ ## AGI-Eval
152
+
153
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/bs3ZvvEACa5Gm4p1JBsZ4.png)
154
+
155
+ ## BigBench Reasoning Test
156
+
157
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/wcceowcVpI12UxliwkOja.png)