davidhornshaw
/

Qwen2.5-3B-ORPO

Text Generation

Safetensors

English

qwen2

Model card Files Files and versions Community

davidhornshaw commited on Oct 27, 2024

Commit

f2f5d41

verified ·

1 Parent(s): 4c65cb9

cosmetics

Browse files

Files changed (1) hide show

README.md +6 -15

README.md CHANGED Viewed

@@ -98,9 +98,8 @@ Summary of results:
 Within standard error, there is no difference between base and finetuned model on any general benchmark. This suggests there was no drop in performance for the chosen tasks due to finetuning.
 Benchmarks for logical and numerical reasoning are more mixed. Without standard error, the finetuned model generally outperforms the base model. However, this lies - often just about - within standard error.
-The finetuned model *does* outperform the base model even accounting for standard error with maximal conservative bias on **arithmetic_5da**.
-This is of interest, since this benchmarks a model's ability to add five digits - *the* most fundamental arithmetic operation, and in effect the most difficult of all addition benchmarks.
-Note subtraction appears generally harder for both the finetuned and base models, even as the finetuned model performs better.
 We highlight the relevant rows for five-digit addition and subtraction for easy comparison.
 Evaluation results:
@@ -113,13 +112,10 @@ Evaluation results:
 |----------|--------:|--------|-------:|--------|---|------:|---|-------:|
 |hellaswag |       1 | none   |      0 |acc     |↑  | 0.5492|±  |  0.0050|
 |          |         | none   |      0 |acc_norm|↑  | 0.7353|±  |  0.0044|
-|----------|--------:|--------|-------:|--------|---|------:|---|-------:|
 |winogrande|       1 | none   |      0 |acc     |↑  | 0.6851|±  |  0.0131|
-|----------|--------:|--------|-------:|--------|---|------:|---|-------:|
 |anli_r1   |       1 | none   |      0 |acc     |↑  | 0.4670|±  |  0.0158|
 |anli_r2   |       1 | none   |      0 |acc     |↑  | 0.4440|±  |  0.0157|
 |anli_r3   |       1 | none   |      0 |acc     |↑  | 0.4467|±  |  0.0144|
-|----------|--------:|--------|-------:|--------|---|------:|---|-------:|
 |blimp     |       2 | none   |      0 |acc     |↑  | 0.7250|±  |  0.0016|
 <figcaption>Collect GENERAL benchmarks results for the base model.</figcaption>
 </figure>
@@ -136,9 +132,8 @@ Evaluation results:
 |arithmetic_3ds|       1 | none   |      0 |acc     |↑  | 0.0055|±  |  0.0017|
 |arithmetic_4da|       1 | none   |      0 |acc     |↑  | 0.0675|±  |  0.0056|
 |arithmetic_4ds|       1 | none   |      0 |acc     |↑  | 0.0010|±  |  0.0007|
-**|arithmetic_5da|       1 | none   |      0 |acc     |↑  | 0.3720|±  |  0.0108|**
-**|arithmetic_5ds|       1 | none   |      0 |acc     |↑  | 0.0260|±  |  0.0036|**
-|--------------|--------:|--------|-------:|--------|---|------:|---|-------:|
 |asdiv         |       1 | none   |      0 |acc     |↑  | 0.0187|±  |  0.0028|
 <figcaption>Collected USECASE benchmarks results for the base model.</figcaption>
@@ -152,13 +147,10 @@ Evaluation results:
 |----------|--------:|--------|-------:|--------|---|------:|---|-------:|
 |hellaswag |       1 | none   |      0 |acc     |↑  | 0.5490|±  |  0.0050|
 |          |         | none   |      0 |acc_norm|↑  | 0.7358|±  |  0.0044|
-|----------|--------:|--------|-------:|--------|---|------:|---|-------:|
 |winogrande|       1 | none   |      0 |acc     |↑  | 0.6827|±  |  0.0131|
-|----------|--------:|--------|-------:|--------|---|------:|---|-------:|
 |anli_r1   |       1 | none   |      0 |acc     |↑  | 0.4660|±  |  0.0158|
 |anli_r2   |       1 | none   |      0 |acc     |↑  | 0.4380|±  |  0.0157|
 |anli_r3   |       1 | none   |      0 |acc     |↑  | 0.4408|±  |  0.0143|
-|----------|--------:|--------|-------:|--------|---|------:|---|-------:|
 |blimp     |       2 | none   |      0 |acc     |↑  | 0.7253|±  |  0.0016|
 <figcaption>Collect GENERAL benchmarks results for the finetuned model.</figcaption>
 </figure>
@@ -175,9 +167,8 @@ Evaluation results:
 |arithmetic_3ds|       1 | none   |      0 |acc     |↑  | 0.0055|±  |  0.0017|
 |arithmetic_4da|       1 | none   |      0 |acc     |↑  | 0.0710|±  |  0.0057|
 |arithmetic_4ds|       1 | none   |      0 |acc     |↑  | 0.0005|±  |  0.0005|
-**|arithmetic_5da|       1 | none   |      0 |acc     |↑  | 0.4005|±  |  0.0110|**
-**|arithmetic_5ds|       1 | none   |      0 |acc     |↑  | 0.0285|±  |  0.0037|**
-|--------------|--------:|--------|-------:|--------|---|------:|---|-------:|
 |asdiv         |       1 | none   |      0 |acc     |↑  | 0.0204|±  |  0.0029|
 <figcaption>Collected USECASE benchmarks results for the finetuned model.</figcaption>

 Within standard error, there is no difference between base and finetuned model on any general benchmark. This suggests there was no drop in performance for the chosen tasks due to finetuning.
 Benchmarks for logical and numerical reasoning are more mixed. Without standard error, the finetuned model generally outperforms the base model. However, this lies - often just about - within standard error.
+The finetuned model *does* outperform the base model even accounting for standard error with maximal conservative bias on [**arithmetic_5da**](https://arxiv.org/abs/2005.14165).
+This is of interest, since this benchmarks a model's ability to add five digits, and addition is *the* fundamental arithmetic operation. Subtraction appears generally harder for both the finetuned and base models, even as the finetuned model performs better.
 We highlight the relevant rows for five-digit addition and subtraction for easy comparison.
 Evaluation results:
 |----------|--------:|--------|-------:|--------|---|------:|---|-------:|
 |hellaswag |       1 | none   |      0 |acc     |↑  | 0.5492|±  |  0.0050|
 |          |         | none   |      0 |acc_norm|↑  | 0.7353|±  |  0.0044|
 |winogrande|       1 | none   |      0 |acc     |↑  | 0.6851|±  |  0.0131|
 |anli_r1   |       1 | none   |      0 |acc     |↑  | 0.4670|±  |  0.0158|
 |anli_r2   |       1 | none   |      0 |acc     |↑  | 0.4440|±  |  0.0157|
 |anli_r3   |       1 | none   |      0 |acc     |↑  | 0.4467|±  |  0.0144|
 |blimp     |       2 | none   |      0 |acc     |↑  | 0.7250|±  |  0.0016|
 <figcaption>Collect GENERAL benchmarks results for the base model.</figcaption>
 </figure>
 |arithmetic_3ds|       1 | none   |      0 |acc     |↑  | 0.0055|±  |  0.0017|
 |arithmetic_4da|       1 | none   |      0 |acc     |↑  | 0.0675|±  |  0.0056|
 |arithmetic_4ds|       1 | none   |      0 |acc     |↑  | 0.0010|±  |  0.0007|
+|**arithmetic_5da**|       1 | none   |      0 |acc     |↑  | **0.3720**|±  |  **0.0108**|
+|**arithmetic_5ds**|       1 | none   |      0 |acc     |↑  | **0.0260**|±  |  **0.0036**|
 |asdiv         |       1 | none   |      0 |acc     |↑  | 0.0187|±  |  0.0028|
 <figcaption>Collected USECASE benchmarks results for the base model.</figcaption>
 |----------|--------:|--------|-------:|--------|---|------:|---|-------:|
 |hellaswag |       1 | none   |      0 |acc     |↑  | 0.5490|±  |  0.0050|
 |          |         | none   |      0 |acc_norm|↑  | 0.7358|±  |  0.0044|
 |winogrande|       1 | none   |      0 |acc     |↑  | 0.6827|±  |  0.0131|
 |anli_r1   |       1 | none   |      0 |acc     |↑  | 0.4660|±  |  0.0158|
 |anli_r2   |       1 | none   |      0 |acc     |↑  | 0.4380|±  |  0.0157|
 |anli_r3   |       1 | none   |      0 |acc     |↑  | 0.4408|±  |  0.0143|
 |blimp     |       2 | none   |      0 |acc     |↑  | 0.7253|±  |  0.0016|
 <figcaption>Collect GENERAL benchmarks results for the finetuned model.</figcaption>
 </figure>
 |arithmetic_3ds|       1 | none   |      0 |acc     |↑  | 0.0055|±  |  0.0017|
 |arithmetic_4da|       1 | none   |      0 |acc     |↑  | 0.0710|±  |  0.0057|
 |arithmetic_4ds|       1 | none   |      0 |acc     |↑  | 0.0005|±  |  0.0005|
+|**arithmetic_5da**|       1 | none   |      0 |acc     |↑  | **0.4005**|±  |  **0.0110**|
+|**arithmetic_5ds**|       1 | none   |      0 |acc     |↑  | **0.0285**|±  |  **0.0037**|
 |asdiv         |       1 | none   |      0 |acc     |↑  | 0.0204|±  |  0.0029|
 <figcaption>Collected USECASE benchmarks results for the finetuned model.</figcaption>