Update README.md
Browse files
README.md
CHANGED
@@ -8,29 +8,20 @@ datasets:
|
|
8 |
- euclaise/TinyCoT
|
9 |
- euclaise/reddit-instruct
|
10 |
- sablo/oasst2_curated
|
|
|
11 |
metrics:
|
12 |
- accuracy
|
13 |
---
|
14 |
|
15 |
|
16 |
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
**Note: A bug has been discovered in the Memphis-CoT training code, and the model is currently being re-trained. Please do not make quants or merges or anything else until I have retrained.**
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
|
29 |
|
30 |
|
31 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64137e2150358a805203cbac/DlTWku8gant1yx6NaxqJX.png)
|
32 |
|
33 |
-
Memphis-CoT is a finetune of [StableLM 3b 4e1t](stabilityai/stablelm-3b-4e1t) on [TinyCoT](https://huggingface.co/datasets/euclaise/TinyCoT), along with [reddit-instruct](https://huggingface.co/datasets/euclaise/reddit-instruct) (subset to 5000 examples, excluding posts with brackets in the title) and a [curated](https://huggingface.co/datasets/sablo/oasst2_curated) subset of [oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2).
|
34 |
|
35 |
**Memphis was trained *only* on human data! No GPT generations here.**
|
36 |
|
@@ -42,12 +33,13 @@ I finetuned the model using an iterative rationale-bootstrapping procedure inspi
|
|
42 |
|
43 |
First, I finetuned the model on all the datasets using a [MixCE](https://arxiv.org/abs/2305.16958) loss and [NEFTune](https://arxiv.org/abs/2310.05914), for 2 epochs.
|
44 |
|
45 |
-
I then performed the following steps
|
46 |
1. Generate responses for each question in TinyCoT using the current model, check each response for correctness, and create a dataset of (correct, incorrect) pairs. Extra values are discarded, such that each correct and incorrect response is unique.
|
47 |
-
2. Finetune the model for 1 epoch using a ranking loss over length-normalized log-probabilities of each sequence, similar to [Preference Ranking Optimization](https://arxiv.org/abs/2306.17492), comparing the correct vs incorrect generated response.
|
48 |
|
49 |
This should be more efficient than either STaR or SPIN, as it uses a ranking loss rather than rejection sampling (unlike STaR), and verifies correctness instead of assuming all model responses are incorrect (unlike SPIN).
|
50 |
|
|
|
51 |
|
52 |
## Prompt formats
|
53 |
|
@@ -81,7 +73,8 @@ The format for TinyCoT was:
|
|
81 |
| [MPT 7B Instruct](https://hf.co/mosaicml/mpt-7b-instruct) | **7B** | **Human**+Anthropic | SFT | 2.05% | 24.12% | 11.01% |
|
82 |
| [OpenLLaMA 7B v2 open-instruct](http://hf.co/VMware/open-llama-7b-v2-open-instruct) | **7B** | **Human** (nearly: ecqa is an exception) | SFT | 8.64% | 23.21% | 29.84% |
|
83 |
| [StableLM Zephyr 3B](https://hf.co/stabilityai/stablelm-zephyr-3b) | 3B | GPT | DPO | possibly contaminated (45.72%) | **33.31%** | 0.91% |
|
84 |
-
| [**Memphis-CoT 3B**](https://hf.co/euclaise/Memphis-CoT-3B) | 3B | **Human** | Self-teaching | **
|
|
|
85 |
*5-shot, as performed automatically by LM Evaluation Harness bbh_cot_fewshot even with num_fewshot=0
|
86 |
|
87 |
Memphis outperforms other primarily-human-data models that are over twice its size, along with SFT models of its size, and trades with the Zephyr DPO model. That said, Zephyr uses synthetic data, and *much* more of it.
|
@@ -121,7 +114,7 @@ For the rank finetuning:
|
|
121 |
- Adalite optimizer, default hyperparameters of supertrainer2000 unless otherwise specified
|
122 |
- Lambda of 0.01
|
123 |
- LR of 5e-7
|
124 |
-
- Rank loss weight of
|
125 |
- Sequence length of 1024
|
126 |
- Cosine schedule with 10% warmup
|
127 |
- Frozen embeddings
|
|
|
8 |
- euclaise/TinyCoT
|
9 |
- euclaise/reddit-instruct
|
10 |
- sablo/oasst2_curated
|
11 |
+
- euclaise/SciCoT
|
12 |
metrics:
|
13 |
- accuracy
|
14 |
---
|
15 |
|
16 |
|
17 |
|
18 |
+
*Now with a training bug fixed!*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
19 |
|
20 |
|
21 |
|
22 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64137e2150358a805203cbac/DlTWku8gant1yx6NaxqJX.png)
|
23 |
|
24 |
+
Memphis-CoT is a finetune of [StableLM 3b 4e1t](stabilityai/stablelm-3b-4e1t) on [TinyCoT](https://huggingface.co/datasets/euclaise/TinyCoT), [SciCoT](https://huggingface.co/datasets/euclaise/SciCoT), along with [reddit-instruct](https://huggingface.co/datasets/euclaise/reddit-instruct) (subset to 5000 examples, excluding posts with brackets in the title) and a [curated](https://huggingface.co/datasets/sablo/oasst2_curated) subset of [oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2).
|
25 |
|
26 |
**Memphis was trained *only* on human data! No GPT generations here.**
|
27 |
|
|
|
33 |
|
34 |
First, I finetuned the model on all the datasets using a [MixCE](https://arxiv.org/abs/2305.16958) loss and [NEFTune](https://arxiv.org/abs/2310.05914), for 2 epochs.
|
35 |
|
36 |
+
I then performed the following steps 3 times:
|
37 |
1. Generate responses for each question in TinyCoT using the current model, check each response for correctness, and create a dataset of (correct, incorrect) pairs. Extra values are discarded, such that each correct and incorrect response is unique.
|
38 |
+
2. Finetune the model for 1 epoch using a ranking loss over length-normalized log-probabilities of each sequence, similar to [Preference Ranking Optimization](https://arxiv.org/abs/2306.17492), comparing the correct vs incorrect generated response. Additionally, a standard CE loss over the chosen completion was included.
|
39 |
|
40 |
This should be more efficient than either STaR or SPIN, as it uses a ranking loss rather than rejection sampling (unlike STaR), and verifies correctness instead of assuming all model responses are incorrect (unlike SPIN).
|
41 |
|
42 |
+
To prevent excessive drift, I kept the model weights as a moving average: After each generate+train cycle, I interpolated between the previous model weights and the updated weights using spherical linear interpolation (SLERP), with an interpolation factor of 0.99.
|
43 |
|
44 |
## Prompt formats
|
45 |
|
|
|
73 |
| [MPT 7B Instruct](https://hf.co/mosaicml/mpt-7b-instruct) | **7B** | **Human**+Anthropic | SFT | 2.05% | 24.12% | 11.01% |
|
74 |
| [OpenLLaMA 7B v2 open-instruct](http://hf.co/VMware/open-llama-7b-v2-open-instruct) | **7B** | **Human** (nearly: ecqa is an exception) | SFT | 8.64% | 23.21% | 29.84% |
|
75 |
| [StableLM Zephyr 3B](https://hf.co/stabilityai/stablelm-zephyr-3b) | 3B | GPT | DPO | possibly contaminated (45.72%) | **33.31%** | 0.91% |
|
76 |
+
| [**Memphis-CoT 3B**](https://hf.co/euclaise/Memphis-CoT-3B) | 3B | **Human** | Self-teaching | **18.8%** | *27.22%* | **36.92%** |
|
77 |
+
|
78 |
*5-shot, as performed automatically by LM Evaluation Harness bbh_cot_fewshot even with num_fewshot=0
|
79 |
|
80 |
Memphis outperforms other primarily-human-data models that are over twice its size, along with SFT models of its size, and trades with the Zephyr DPO model. That said, Zephyr uses synthetic data, and *much* more of it.
|
|
|
114 |
- Adalite optimizer, default hyperparameters of supertrainer2000 unless otherwise specified
|
115 |
- Lambda of 0.01
|
116 |
- LR of 5e-7
|
117 |
+
- Rank loss weight of 0.25
|
118 |
- Sequence length of 1024
|
119 |
- Cosine schedule with 10% warmup
|
120 |
- Frozen embeddings
|