Text Generation
Safetensors
English
qwen2
davidhornshaw commited on
Commit
f2f5d41
1 Parent(s): 4c65cb9
Files changed (1) hide show
  1. README.md +6 -15
README.md CHANGED
@@ -98,9 +98,8 @@ Summary of results:
98
 
99
  Within standard error, there is no difference between base and finetuned model on any general benchmark. This suggests there was no drop in performance for the chosen tasks due to finetuning.
100
  Benchmarks for logical and numerical reasoning are more mixed. Without standard error, the finetuned model generally outperforms the base model. However, this lies - often just about - within standard error.
101
- The finetuned model *does* outperform the base model even accounting for standard error with maximal conservative bias on **arithmetic_5da**.
102
- This is of interest, since this benchmarks a model's ability to add five digits - *the* most fundamental arithmetic operation, and in effect the most difficult of all addition benchmarks.
103
- Note subtraction appears generally harder for both the finetuned and base models, even as the finetuned model performs better.
104
  We highlight the relevant rows for five-digit addition and subtraction for easy comparison.
105
 
106
  Evaluation results:
@@ -113,13 +112,10 @@ Evaluation results:
113
  |----------|--------:|--------|-------:|--------|---|------:|---|-------:|
114
  |hellaswag | 1 | none | 0 |acc |↑ | 0.5492|± | 0.0050|
115
  | | | none | 0 |acc_norm|↑ | 0.7353|± | 0.0044|
116
- |----------|--------:|--------|-------:|--------|---|------:|---|-------:|
117
  |winogrande| 1 | none | 0 |acc |↑ | 0.6851|± | 0.0131|
118
- |----------|--------:|--------|-------:|--------|---|------:|---|-------:|
119
  |anli_r1 | 1 | none | 0 |acc |↑ | 0.4670|± | 0.0158|
120
  |anli_r2 | 1 | none | 0 |acc |↑ | 0.4440|± | 0.0157|
121
  |anli_r3 | 1 | none | 0 |acc |↑ | 0.4467|± | 0.0144|
122
- |----------|--------:|--------|-------:|--------|---|------:|---|-------:|
123
  |blimp | 2 | none | 0 |acc |↑ | 0.7250|± | 0.0016|
124
  <figcaption>Collect GENERAL benchmarks results for the base model.</figcaption>
125
  </figure>
@@ -136,9 +132,8 @@ Evaluation results:
136
  |arithmetic_3ds| 1 | none | 0 |acc |↑ | 0.0055|± | 0.0017|
137
  |arithmetic_4da| 1 | none | 0 |acc |↑ | 0.0675|± | 0.0056|
138
  |arithmetic_4ds| 1 | none | 0 |acc |↑ | 0.0010|± | 0.0007|
139
- **|arithmetic_5da| 1 | none | 0 |acc |↑ | 0.3720 | 0.0108|**
140
- **|arithmetic_5ds| 1 | none | 0 |acc |↑ | 0.0260 | 0.0036|**
141
- |--------------|--------:|--------|-------:|--------|---|------:|---|-------:|
142
  |asdiv | 1 | none | 0 |acc |↑ | 0.0187|± | 0.0028|
143
  <figcaption>Collected USECASE benchmarks results for the base model.</figcaption>
144
 
@@ -152,13 +147,10 @@ Evaluation results:
152
  |----------|--------:|--------|-------:|--------|---|------:|---|-------:|
153
  |hellaswag | 1 | none | 0 |acc |↑ | 0.5490|± | 0.0050|
154
  | | | none | 0 |acc_norm|↑ | 0.7358|± | 0.0044|
155
- |----------|--------:|--------|-------:|--------|---|------:|---|-------:|
156
  |winogrande| 1 | none | 0 |acc |↑ | 0.6827|± | 0.0131|
157
- |----------|--------:|--------|-------:|--------|---|------:|---|-------:|
158
  |anli_r1 | 1 | none | 0 |acc |↑ | 0.4660|± | 0.0158|
159
  |anli_r2 | 1 | none | 0 |acc |↑ | 0.4380|± | 0.0157|
160
  |anli_r3 | 1 | none | 0 |acc |↑ | 0.4408|± | 0.0143|
161
- |----------|--------:|--------|-------:|--------|---|------:|---|-------:|
162
  |blimp | 2 | none | 0 |acc |↑ | 0.7253|± | 0.0016|
163
  <figcaption>Collect GENERAL benchmarks results for the finetuned model.</figcaption>
164
  </figure>
@@ -175,9 +167,8 @@ Evaluation results:
175
  |arithmetic_3ds| 1 | none | 0 |acc |↑ | 0.0055|± | 0.0017|
176
  |arithmetic_4da| 1 | none | 0 |acc |↑ | 0.0710|± | 0.0057|
177
  |arithmetic_4ds| 1 | none | 0 |acc |↑ | 0.0005|± | 0.0005|
178
- **|arithmetic_5da| 1 | none | 0 |acc |↑ | 0.4005 | 0.0110|**
179
- **|arithmetic_5ds| 1 | none | 0 |acc |↑ | 0.0285 | 0.0037|**
180
- |--------------|--------:|--------|-------:|--------|---|------:|---|-------:|
181
  |asdiv | 1 | none | 0 |acc |↑ | 0.0204|± | 0.0029|
182
  <figcaption>Collected USECASE benchmarks results for the finetuned model.</figcaption>
183
 
 
98
 
99
  Within standard error, there is no difference between base and finetuned model on any general benchmark. This suggests there was no drop in performance for the chosen tasks due to finetuning.
100
  Benchmarks for logical and numerical reasoning are more mixed. Without standard error, the finetuned model generally outperforms the base model. However, this lies - often just about - within standard error.
101
+ The finetuned model *does* outperform the base model even accounting for standard error with maximal conservative bias on [**arithmetic_5da**](https://arxiv.org/abs/2005.14165).
102
+ This is of interest, since this benchmarks a model's ability to add five digits, and addition is *the* fundamental arithmetic operation. Subtraction appears generally harder for both the finetuned and base models, even as the finetuned model performs better.
 
103
  We highlight the relevant rows for five-digit addition and subtraction for easy comparison.
104
 
105
  Evaluation results:
 
112
  |----------|--------:|--------|-------:|--------|---|------:|---|-------:|
113
  |hellaswag | 1 | none | 0 |acc |↑ | 0.5492|± | 0.0050|
114
  | | | none | 0 |acc_norm|↑ | 0.7353|± | 0.0044|
 
115
  |winogrande| 1 | none | 0 |acc |↑ | 0.6851|± | 0.0131|
 
116
  |anli_r1 | 1 | none | 0 |acc |↑ | 0.4670|± | 0.0158|
117
  |anli_r2 | 1 | none | 0 |acc |↑ | 0.4440|± | 0.0157|
118
  |anli_r3 | 1 | none | 0 |acc |↑ | 0.4467|± | 0.0144|
 
119
  |blimp | 2 | none | 0 |acc |↑ | 0.7250|± | 0.0016|
120
  <figcaption>Collect GENERAL benchmarks results for the base model.</figcaption>
121
  </figure>
 
132
  |arithmetic_3ds| 1 | none | 0 |acc |↑ | 0.0055|± | 0.0017|
133
  |arithmetic_4da| 1 | none | 0 |acc |↑ | 0.0675|± | 0.0056|
134
  |arithmetic_4ds| 1 | none | 0 |acc |↑ | 0.0010|± | 0.0007|
135
+ |**arithmetic_5da**| 1 | none | 0 |acc |↑ | **0.3720**|± | **0.0108**|
136
+ |**arithmetic_5ds**| 1 | none | 0 |acc |↑ | **0.0260**|± | **0.0036**|
 
137
  |asdiv | 1 | none | 0 |acc |↑ | 0.0187|± | 0.0028|
138
  <figcaption>Collected USECASE benchmarks results for the base model.</figcaption>
139
 
 
147
  |----------|--------:|--------|-------:|--------|---|------:|---|-------:|
148
  |hellaswag | 1 | none | 0 |acc |↑ | 0.5490|± | 0.0050|
149
  | | | none | 0 |acc_norm|↑ | 0.7358|± | 0.0044|
 
150
  |winogrande| 1 | none | 0 |acc |↑ | 0.6827|± | 0.0131|
 
151
  |anli_r1 | 1 | none | 0 |acc |↑ | 0.4660|± | 0.0158|
152
  |anli_r2 | 1 | none | 0 |acc |↑ | 0.4380|± | 0.0157|
153
  |anli_r3 | 1 | none | 0 |acc |↑ | 0.4408|± | 0.0143|
 
154
  |blimp | 2 | none | 0 |acc |↑ | 0.7253|± | 0.0016|
155
  <figcaption>Collect GENERAL benchmarks results for the finetuned model.</figcaption>
156
  </figure>
 
167
  |arithmetic_3ds| 1 | none | 0 |acc |↑ | 0.0055|± | 0.0017|
168
  |arithmetic_4da| 1 | none | 0 |acc |↑ | 0.0710|± | 0.0057|
169
  |arithmetic_4ds| 1 | none | 0 |acc |↑ | 0.0005|± | 0.0005|
170
+ |**arithmetic_5da**| 1 | none | 0 |acc |↑ | **0.4005**|± | **0.0110**|
171
+ |**arithmetic_5ds**| 1 | none | 0 |acc |↑ | **0.0285**|± | **0.0037**|
 
172
  |asdiv | 1 | none | 0 |acc |↑ | 0.0204|± | 0.0029|
173
  <figcaption>Collected USECASE benchmarks results for the finetuned model.</figcaption>
174