davidhornshaw
commited on
Commit
•
705b6df
1
Parent(s):
0b59b09
Update README.md
Browse files
README.md
CHANGED
@@ -75,12 +75,12 @@ We used the trl [ORPO trainer](https://huggingface.co/docs/trl/main/en/orpo_trai
|
|
75 |
|
76 |
- **Training regime:** fp16 non-mixed precision
|
77 |
|
78 |
-
|
79 |
|
80 |
We evaluate base and finetuned models on four general benchmarks and two usecase specific one. We work with an Eleuther test harness.
|
81 |
Our usecase is logical and numerical reasoning.
|
82 |
|
83 |
-
Benchmarks used:
|
84 |
|
85 |
1. GENERAL A: Commonsense natural language reasoning.
|
86 |
1.1 HellaSwag
|
@@ -94,15 +94,15 @@ Benchmarks used:
|
|
94 |
3.1 Arithmetic
|
95 |
3.2 ASDiv
|
96 |
|
97 |
-
Summary of results:
|
98 |
|
99 |
Within standard error, there is no difference between base and finetuned model on any general benchmark. This suggests there was no drop in performance for the chosen tasks due to finetuning.
|
100 |
Benchmarks for logical and numerical reasoning are more mixed. Without standard error, the finetuned model generally outperforms the base model. However, this lies - often just about - within standard error.
|
101 |
The finetuned model *does* outperform the base model even accounting for standard error with maximal conservative bias on [**arithmetic_5da**](https://arxiv.org/abs/2005.14165).
|
102 |
This is of interest, since this benchmarks a model's ability to add five digits, and addition is *the* fundamental arithmetic operation. Subtraction appears generally harder for both the finetuned and base models, even as the finetuned model performs better.
|
103 |
-
We highlight the relevant rows for five-digit addition and subtraction for easy comparison.
|
104 |
|
105 |
-
Evaluation results:
|
106 |
|
107 |
**BASE**
|
108 |
|
@@ -172,4 +172,7 @@ Evaluation results:
|
|
172 |
|asdiv | 1 | none | 0 |acc |↑ | 0.0204|± | 0.0029|
|
173 |
<figcaption>Collected USECASE benchmarks results for the finetuned model.</figcaption>
|
174 |
|
175 |
-
</figure>
|
|
|
|
|
|
|
|
75 |
|
76 |
- **Training regime:** fp16 non-mixed precision
|
77 |
|
78 |
+
# Evaluation
|
79 |
|
80 |
We evaluate base and finetuned models on four general benchmarks and two usecase specific one. We work with an Eleuther test harness.
|
81 |
Our usecase is logical and numerical reasoning.
|
82 |
|
83 |
+
## Benchmarks used:
|
84 |
|
85 |
1. GENERAL A: Commonsense natural language reasoning.
|
86 |
1.1 HellaSwag
|
|
|
94 |
3.1 Arithmetic
|
95 |
3.2 ASDiv
|
96 |
|
97 |
+
## Summary of results:
|
98 |
|
99 |
Within standard error, there is no difference between base and finetuned model on any general benchmark. This suggests there was no drop in performance for the chosen tasks due to finetuning.
|
100 |
Benchmarks for logical and numerical reasoning are more mixed. Without standard error, the finetuned model generally outperforms the base model. However, this lies - often just about - within standard error.
|
101 |
The finetuned model *does* outperform the base model even accounting for standard error with maximal conservative bias on [**arithmetic_5da**](https://arxiv.org/abs/2005.14165).
|
102 |
This is of interest, since this benchmarks a model's ability to add five digits, and addition is *the* fundamental arithmetic operation. Subtraction appears generally harder for both the finetuned and base models, even as the finetuned model performs better.
|
103 |
+
We highlight the relevant rows for five-digit addition and subtraction for easy comparison. Moreover, we give a visualisation of the performance gain *without standard error* for all usecase specific benchmarks at the end of the model card.
|
104 |
|
105 |
+
## Evaluation results:
|
106 |
|
107 |
**BASE**
|
108 |
|
|
|
172 |
|asdiv | 1 | none | 0 |acc |↑ | 0.0204|± | 0.0029|
|
173 |
<figcaption>Collected USECASE benchmarks results for the finetuned model.</figcaption>
|
174 |
|
175 |
+
</figure>
|
176 |
+
|
177 |
+
**VISUALISATION OF FIVE DIGIT ADVANTAGE**
|
178 |
+
|