Text Generation
Safetensors
English
qwen2
davidhornshaw commited on
Commit
705b6df
1 Parent(s): 0b59b09

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -6
README.md CHANGED
@@ -75,12 +75,12 @@ We used the trl [ORPO trainer](https://huggingface.co/docs/trl/main/en/orpo_trai
75
 
76
  - **Training regime:** fp16 non-mixed precision
77
 
78
- ## Evaluation
79
 
80
  We evaluate base and finetuned models on four general benchmarks and two usecase specific one. We work with an Eleuther test harness.
81
  Our usecase is logical and numerical reasoning.
82
 
83
- Benchmarks used:
84
 
85
  1. GENERAL A: Commonsense natural language reasoning.
86
  1.1 HellaSwag
@@ -94,15 +94,15 @@ Benchmarks used:
94
  3.1 Arithmetic
95
  3.2 ASDiv
96
 
97
- Summary of results:
98
 
99
  Within standard error, there is no difference between base and finetuned model on any general benchmark. This suggests there was no drop in performance for the chosen tasks due to finetuning.
100
  Benchmarks for logical and numerical reasoning are more mixed. Without standard error, the finetuned model generally outperforms the base model. However, this lies - often just about - within standard error.
101
  The finetuned model *does* outperform the base model even accounting for standard error with maximal conservative bias on [**arithmetic_5da**](https://arxiv.org/abs/2005.14165).
102
  This is of interest, since this benchmarks a model's ability to add five digits, and addition is *the* fundamental arithmetic operation. Subtraction appears generally harder for both the finetuned and base models, even as the finetuned model performs better.
103
- We highlight the relevant rows for five-digit addition and subtraction for easy comparison.
104
 
105
- Evaluation results:
106
 
107
  **BASE**
108
 
@@ -172,4 +172,7 @@ Evaluation results:
172
  |asdiv | 1 | none | 0 |acc |↑ | 0.0204|± | 0.0029|
173
  <figcaption>Collected USECASE benchmarks results for the finetuned model.</figcaption>
174
 
175
- </figure>
 
 
 
 
75
 
76
  - **Training regime:** fp16 non-mixed precision
77
 
78
+ # Evaluation
79
 
80
  We evaluate base and finetuned models on four general benchmarks and two usecase specific one. We work with an Eleuther test harness.
81
  Our usecase is logical and numerical reasoning.
82
 
83
+ ## Benchmarks used:
84
 
85
  1. GENERAL A: Commonsense natural language reasoning.
86
  1.1 HellaSwag
 
94
  3.1 Arithmetic
95
  3.2 ASDiv
96
 
97
+ ## Summary of results:
98
 
99
  Within standard error, there is no difference between base and finetuned model on any general benchmark. This suggests there was no drop in performance for the chosen tasks due to finetuning.
100
  Benchmarks for logical and numerical reasoning are more mixed. Without standard error, the finetuned model generally outperforms the base model. However, this lies - often just about - within standard error.
101
  The finetuned model *does* outperform the base model even accounting for standard error with maximal conservative bias on [**arithmetic_5da**](https://arxiv.org/abs/2005.14165).
102
  This is of interest, since this benchmarks a model's ability to add five digits, and addition is *the* fundamental arithmetic operation. Subtraction appears generally harder for both the finetuned and base models, even as the finetuned model performs better.
103
+ We highlight the relevant rows for five-digit addition and subtraction for easy comparison. Moreover, we give a visualisation of the performance gain *without standard error* for all usecase specific benchmarks at the end of the model card.
104
 
105
+ ## Evaluation results:
106
 
107
  **BASE**
108
 
 
172
  |asdiv | 1 | none | 0 |acc |↑ | 0.0204|± | 0.0029|
173
  <figcaption>Collected USECASE benchmarks results for the finetuned model.</figcaption>
174
 
175
+ </figure>
176
+
177
+ **VISUALISATION OF FIVE DIGIT ADVANTAGE**
178
+