Update README.md
Browse files
README.md
CHANGED
@@ -7,6 +7,9 @@ datasets:
|
|
7 |
library_name: transformers
|
8 |
tags:
|
9 |
- supertrainer2000
|
|
|
|
|
|
|
10 |
---
|
11 |
|
12 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64137e2150358a805203cbac/DlTWku8gant1yx6NaxqJX.png)
|
@@ -43,7 +46,7 @@ The format for reddit-instruct and oasst2 was:
|
|
43 |
...
|
44 |
```
|
45 |
|
46 |
-
The format for
|
47 |
```
|
48 |
### User:
|
49 |
[insert instruction here]
|
@@ -53,6 +56,26 @@ The format for TinyCot was:
|
|
53 |
[insert direct answer here]
|
54 |
```
|
55 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
56 |
## Hyperparameters
|
57 |
|
58 |
For the initial supervised finetuning step:
|
|
|
7 |
library_name: transformers
|
8 |
tags:
|
9 |
- supertrainer2000
|
10 |
+
- human-data
|
11 |
+
metrics:
|
12 |
+
- accuracy
|
13 |
---
|
14 |
|
15 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64137e2150358a805203cbac/DlTWku8gant1yx6NaxqJX.png)
|
|
|
46 |
...
|
47 |
```
|
48 |
|
49 |
+
The format for TinyCoT was:
|
50 |
```
|
51 |
### User:
|
52 |
[insert instruction here]
|
|
|
56 |
[insert direct answer here]
|
57 |
```
|
58 |
|
59 |
+
## Benchmarks
|
60 |
+
|
61 |
+
| Model | Size | Data | Method | GSM8K (5-shot) | AGIEval (English/Nous subset, acc_norm) |
|
62 |
+
|:-----------------------------------------------------------------------|--------|:--------------------|---------------|:---------------|:----------------------------------------|
|
63 |
+
| [StableLM 3B Base](https://hf.co/stabilityai/stablelm-zephyr-3b) | 3B | Base | Base | 2.05% | 25.14% |
|
64 |
+
| [StableHermes 3B](https://hf.co/cxllin/StableHermes-3b) | 3B | GPT | SFT | 3.64% | 24.31% |
|
65 |
+
| [MPT 7B Instruct](mosaicml/mpt-7b-instruct) | **7B** | **Human+Anthropic** | SFT | 2.05% | 24.12% |
|
66 |
+
| [OpenLLaMA 7B v2 open-instruct](http://hf.co/VMware/open-llama-7b-v2-open-instruct) | **7B** | **Human** (nearly: ecqa is an exception) | SFT | 8.64% | 23.21% |
|
67 |
+
| [StableLM Zephyr 3B](https://hf.co/stabilityai/stablelm-zephyr-3b) | 3B | GPT | DPO | **45.72%** | **33.31%** |
|
68 |
+
| **[Memphis-CoT 3B](https://hf.co/euclaise/memphis-cot-3b)** | 3B | **Human** | Self-teaching | 13.8% | *26.24%* |
|
69 |
+
|
70 |
+
|
71 |
+
Memphis outperforms human-data models that are over twice its size, along with SFT models of its size, but doesn't quite reach the performance of the Zephyr DPO model. That said, Zephyr uses synthetic data, and *much* more of it.
|
72 |
+
|
73 |
+
|
74 |
+
Notes:
|
75 |
+
- Evaluations were performed using the `agieval` branch of [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) (commit `0bef5c9c273b1c2f68e6018d4bb9c32b9aaff298`), using the `vllm` model.
|
76 |
+
- I tried to find human-data-trained StableLM models, but couldn't find any. I did find a few OpenLLaMA models, but they wouldn't load with LM Eval Harness and vllm.
|
77 |
+
- OpenLLaMA 7B v2 open-instruct is a particularly relevant comparison, as it was trained on a *very* similar dataset.
|
78 |
+
|
79 |
## Hyperparameters
|
80 |
|
81 |
For the initial supervised finetuning step:
|