davidhornshaw
commited on
Commit
•
f2f5d41
1
Parent(s):
4c65cb9
cosmetics
Browse files
README.md
CHANGED
@@ -98,9 +98,8 @@ Summary of results:
|
|
98 |
|
99 |
Within standard error, there is no difference between base and finetuned model on any general benchmark. This suggests there was no drop in performance for the chosen tasks due to finetuning.
|
100 |
Benchmarks for logical and numerical reasoning are more mixed. Without standard error, the finetuned model generally outperforms the base model. However, this lies - often just about - within standard error.
|
101 |
-
The finetuned model *does* outperform the base model even accounting for standard error with maximal conservative bias on **arithmetic_5da
|
102 |
-
This is of interest, since this benchmarks a model's ability to add five digits
|
103 |
-
Note subtraction appears generally harder for both the finetuned and base models, even as the finetuned model performs better.
|
104 |
We highlight the relevant rows for five-digit addition and subtraction for easy comparison.
|
105 |
|
106 |
Evaluation results:
|
@@ -113,13 +112,10 @@ Evaluation results:
|
|
113 |
|----------|--------:|--------|-------:|--------|---|------:|---|-------:|
|
114 |
|hellaswag | 1 | none | 0 |acc |↑ | 0.5492|± | 0.0050|
|
115 |
| | | none | 0 |acc_norm|↑ | 0.7353|± | 0.0044|
|
116 |
-
|----------|--------:|--------|-------:|--------|---|------:|---|-------:|
|
117 |
|winogrande| 1 | none | 0 |acc |↑ | 0.6851|± | 0.0131|
|
118 |
-
|----------|--------:|--------|-------:|--------|---|------:|---|-------:|
|
119 |
|anli_r1 | 1 | none | 0 |acc |↑ | 0.4670|± | 0.0158|
|
120 |
|anli_r2 | 1 | none | 0 |acc |↑ | 0.4440|± | 0.0157|
|
121 |
|anli_r3 | 1 | none | 0 |acc |↑ | 0.4467|± | 0.0144|
|
122 |
-
|----------|--------:|--------|-------:|--------|---|------:|---|-------:|
|
123 |
|blimp | 2 | none | 0 |acc |↑ | 0.7250|± | 0.0016|
|
124 |
<figcaption>Collect GENERAL benchmarks results for the base model.</figcaption>
|
125 |
</figure>
|
@@ -136,9 +132,8 @@ Evaluation results:
|
|
136 |
|arithmetic_3ds| 1 | none | 0 |acc |↑ | 0.0055|± | 0.0017|
|
137 |
|arithmetic_4da| 1 | none | 0 |acc |↑ | 0.0675|± | 0.0056|
|
138 |
|arithmetic_4ds| 1 | none | 0 |acc |↑ | 0.0010|± | 0.0007|
|
139 |
-
**|
|
140 |
-
**|
|
141 |
-
|--------------|--------:|--------|-------:|--------|---|------:|---|-------:|
|
142 |
|asdiv | 1 | none | 0 |acc |↑ | 0.0187|± | 0.0028|
|
143 |
<figcaption>Collected USECASE benchmarks results for the base model.</figcaption>
|
144 |
|
@@ -152,13 +147,10 @@ Evaluation results:
|
|
152 |
|----------|--------:|--------|-------:|--------|---|------:|---|-------:|
|
153 |
|hellaswag | 1 | none | 0 |acc |↑ | 0.5490|± | 0.0050|
|
154 |
| | | none | 0 |acc_norm|↑ | 0.7358|± | 0.0044|
|
155 |
-
|----------|--------:|--------|-------:|--------|---|------:|---|-------:|
|
156 |
|winogrande| 1 | none | 0 |acc |↑ | 0.6827|± | 0.0131|
|
157 |
-
|----------|--------:|--------|-------:|--------|---|------:|---|-------:|
|
158 |
|anli_r1 | 1 | none | 0 |acc |↑ | 0.4660|± | 0.0158|
|
159 |
|anli_r2 | 1 | none | 0 |acc |↑ | 0.4380|± | 0.0157|
|
160 |
|anli_r3 | 1 | none | 0 |acc |↑ | 0.4408|± | 0.0143|
|
161 |
-
|----------|--------:|--------|-------:|--------|---|------:|---|-------:|
|
162 |
|blimp | 2 | none | 0 |acc |↑ | 0.7253|± | 0.0016|
|
163 |
<figcaption>Collect GENERAL benchmarks results for the finetuned model.</figcaption>
|
164 |
</figure>
|
@@ -175,9 +167,8 @@ Evaluation results:
|
|
175 |
|arithmetic_3ds| 1 | none | 0 |acc |↑ | 0.0055|± | 0.0017|
|
176 |
|arithmetic_4da| 1 | none | 0 |acc |↑ | 0.0710|± | 0.0057|
|
177 |
|arithmetic_4ds| 1 | none | 0 |acc |↑ | 0.0005|± | 0.0005|
|
178 |
-
**|
|
179 |
-
**|
|
180 |
-
|--------------|--------:|--------|-------:|--------|---|------:|---|-------:|
|
181 |
|asdiv | 1 | none | 0 |acc |↑ | 0.0204|± | 0.0029|
|
182 |
<figcaption>Collected USECASE benchmarks results for the finetuned model.</figcaption>
|
183 |
|
|
|
98 |
|
99 |
Within standard error, there is no difference between base and finetuned model on any general benchmark. This suggests there was no drop in performance for the chosen tasks due to finetuning.
|
100 |
Benchmarks for logical and numerical reasoning are more mixed. Without standard error, the finetuned model generally outperforms the base model. However, this lies - often just about - within standard error.
|
101 |
+
The finetuned model *does* outperform the base model even accounting for standard error with maximal conservative bias on [**arithmetic_5da**](https://arxiv.org/abs/2005.14165).
|
102 |
+
This is of interest, since this benchmarks a model's ability to add five digits, and addition is *the* fundamental arithmetic operation. Subtraction appears generally harder for both the finetuned and base models, even as the finetuned model performs better.
|
|
|
103 |
We highlight the relevant rows for five-digit addition and subtraction for easy comparison.
|
104 |
|
105 |
Evaluation results:
|
|
|
112 |
|----------|--------:|--------|-------:|--------|---|------:|---|-------:|
|
113 |
|hellaswag | 1 | none | 0 |acc |↑ | 0.5492|± | 0.0050|
|
114 |
| | | none | 0 |acc_norm|↑ | 0.7353|± | 0.0044|
|
|
|
115 |
|winogrande| 1 | none | 0 |acc |↑ | 0.6851|± | 0.0131|
|
|
|
116 |
|anli_r1 | 1 | none | 0 |acc |↑ | 0.4670|± | 0.0158|
|
117 |
|anli_r2 | 1 | none | 0 |acc |↑ | 0.4440|± | 0.0157|
|
118 |
|anli_r3 | 1 | none | 0 |acc |↑ | 0.4467|± | 0.0144|
|
|
|
119 |
|blimp | 2 | none | 0 |acc |↑ | 0.7250|± | 0.0016|
|
120 |
<figcaption>Collect GENERAL benchmarks results for the base model.</figcaption>
|
121 |
</figure>
|
|
|
132 |
|arithmetic_3ds| 1 | none | 0 |acc |↑ | 0.0055|± | 0.0017|
|
133 |
|arithmetic_4da| 1 | none | 0 |acc |↑ | 0.0675|± | 0.0056|
|
134 |
|arithmetic_4ds| 1 | none | 0 |acc |↑ | 0.0010|± | 0.0007|
|
135 |
+
|**arithmetic_5da**| 1 | none | 0 |acc |↑ | **0.3720**|± | **0.0108**|
|
136 |
+
|**arithmetic_5ds**| 1 | none | 0 |acc |↑ | **0.0260**|± | **0.0036**|
|
|
|
137 |
|asdiv | 1 | none | 0 |acc |↑ | 0.0187|± | 0.0028|
|
138 |
<figcaption>Collected USECASE benchmarks results for the base model.</figcaption>
|
139 |
|
|
|
147 |
|----------|--------:|--------|-------:|--------|---|------:|---|-------:|
|
148 |
|hellaswag | 1 | none | 0 |acc |↑ | 0.5490|± | 0.0050|
|
149 |
| | | none | 0 |acc_norm|↑ | 0.7358|± | 0.0044|
|
|
|
150 |
|winogrande| 1 | none | 0 |acc |↑ | 0.6827|± | 0.0131|
|
|
|
151 |
|anli_r1 | 1 | none | 0 |acc |↑ | 0.4660|± | 0.0158|
|
152 |
|anli_r2 | 1 | none | 0 |acc |↑ | 0.4380|± | 0.0157|
|
153 |
|anli_r3 | 1 | none | 0 |acc |↑ | 0.4408|± | 0.0143|
|
|
|
154 |
|blimp | 2 | none | 0 |acc |↑ | 0.7253|± | 0.0016|
|
155 |
<figcaption>Collect GENERAL benchmarks results for the finetuned model.</figcaption>
|
156 |
</figure>
|
|
|
167 |
|arithmetic_3ds| 1 | none | 0 |acc |↑ | 0.0055|± | 0.0017|
|
168 |
|arithmetic_4da| 1 | none | 0 |acc |↑ | 0.0710|± | 0.0057|
|
169 |
|arithmetic_4ds| 1 | none | 0 |acc |↑ | 0.0005|± | 0.0005|
|
170 |
+
|**arithmetic_5da**| 1 | none | 0 |acc |↑ | **0.4005**|± | **0.0110**|
|
171 |
+
|**arithmetic_5ds**| 1 | none | 0 |acc |↑ | **0.0285**|± | **0.0037**|
|
|
|
172 |
|asdiv | 1 | none | 0 |acc |↑ | 0.0204|± | 0.0029|
|
173 |
<figcaption>Collected USECASE benchmarks results for the finetuned model.</figcaption>
|
174 |
|