euclaise commited on
Commit
cb86a0f
1 Parent(s): c297e2a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -1
README.md CHANGED
@@ -7,6 +7,9 @@ datasets:
7
  library_name: transformers
8
  tags:
9
  - supertrainer2000
 
 
 
10
  ---
11
 
12
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64137e2150358a805203cbac/DlTWku8gant1yx6NaxqJX.png)
@@ -43,7 +46,7 @@ The format for reddit-instruct and oasst2 was:
43
  ...
44
  ```
45
 
46
- The format for TinyCot was:
47
  ```
48
  ### User:
49
  [insert instruction here]
@@ -53,6 +56,26 @@ The format for TinyCot was:
53
  [insert direct answer here]
54
  ```
55
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
  ## Hyperparameters
57
 
58
  For the initial supervised finetuning step:
 
7
  library_name: transformers
8
  tags:
9
  - supertrainer2000
10
+ - human-data
11
+ metrics:
12
+ - accuracy
13
  ---
14
 
15
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64137e2150358a805203cbac/DlTWku8gant1yx6NaxqJX.png)
 
46
  ...
47
  ```
48
 
49
+ The format for TinyCoT was:
50
  ```
51
  ### User:
52
  [insert instruction here]
 
56
  [insert direct answer here]
57
  ```
58
 
59
+ ## Benchmarks
60
+
61
+ | Model | Size | Data | Method | GSM8K (5-shot) | AGIEval (English/Nous subset, acc_norm) |
62
+ |:-----------------------------------------------------------------------|--------|:--------------------|---------------|:---------------|:----------------------------------------|
63
+ | [StableLM 3B Base](https://hf.co/stabilityai/stablelm-zephyr-3b) | 3B | Base | Base | 2.05% | 25.14% |
64
+ | [StableHermes 3B](https://hf.co/cxllin/StableHermes-3b) | 3B | GPT | SFT | 3.64% | 24.31% |
65
+ | [MPT 7B Instruct](mosaicml/mpt-7b-instruct) | **7B** | **Human+Anthropic** | SFT | 2.05% | 24.12% |
66
+ | [OpenLLaMA 7B v2 open-instruct](http://hf.co/VMware/open-llama-7b-v2-open-instruct) | **7B** | **Human** (nearly: ecqa is an exception) | SFT | 8.64% | 23.21% |
67
+ | [StableLM Zephyr 3B](https://hf.co/stabilityai/stablelm-zephyr-3b) | 3B | GPT | DPO | **45.72%** | **33.31%** |
68
+ | **[Memphis-CoT 3B](https://hf.co/euclaise/memphis-cot-3b)** | 3B | **Human** | Self-teaching | 13.8% | *26.24%* |
69
+
70
+
71
+ Memphis outperforms human-data models that are over twice its size, along with SFT models of its size, but doesn't quite reach the performance of the Zephyr DPO model. That said, Zephyr uses synthetic data, and *much* more of it.
72
+
73
+
74
+ Notes:
75
+ - Evaluations were performed using the `agieval` branch of [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) (commit `0bef5c9c273b1c2f68e6018d4bb9c32b9aaff298`), using the `vllm` model.
76
+ - I tried to find human-data-trained StableLM models, but couldn't find any. I did find a few OpenLLaMA models, but they wouldn't load with LM Eval Harness and vllm.
77
+ - OpenLLaMA 7B v2 open-instruct is a particularly relevant comparison, as it was trained on a *very* similar dataset.
78
+
79
  ## Hyperparameters
80
 
81
  For the initial supervised finetuning step: