euclaise
/

Memphis-CoT-3B

@@ -8,29 +8,20 @@ datasets:
 - euclaise/TinyCoT
 - euclaise/reddit-instruct
 - sablo/oasst2_curated
 metrics:
 - accuracy
 ---
-**Note: A bug has been discovered in the Memphis-CoT training code, and the model is currently being re-trained. Please do not make quants or merges or anything else until I have retrained.**
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64137e2150358a805203cbac/DlTWku8gant1yx6NaxqJX.png)
-Memphis-CoT is a finetune of [StableLM 3b 4e1t](stabilityai/stablelm-3b-4e1t) on [TinyCoT](https://huggingface.co/datasets/euclaise/TinyCoT), along with [reddit-instruct](https://huggingface.co/datasets/euclaise/reddit-instruct) (subset to 5000 examples, excluding posts with brackets in the title) and a [curated](https://huggingface.co/datasets/sablo/oasst2_curated) subset of [oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2).
 **Memphis was trained *only* on human data! No GPT generations here.**
@@ -42,12 +33,13 @@ I finetuned the model using an iterative rationale-bootstrapping procedure inspi
 First, I finetuned the model on all the datasets using a [MixCE](https://arxiv.org/abs/2305.16958) loss and [NEFTune](https://arxiv.org/abs/2310.05914), for 2 epochs.
-I then performed the following steps 4 times:
 1. Generate responses for each question in TinyCoT using the current model, check each response for correctness, and create a dataset of (correct, incorrect) pairs. Extra values are discarded, such that each correct and incorrect response is unique.
-2. Finetune the model for 1 epoch using a ranking loss over length-normalized log-probabilities of each sequence, similar to [Preference Ranking Optimization](https://arxiv.org/abs/2306.17492), comparing the correct vs incorrect generated response. A standard CE loss over the ground-truth was included to prevent excessive drift.
 This should be more efficient than either STaR or SPIN, as it uses a ranking loss rather than rejection sampling (unlike STaR), and verifies correctness instead of assuming all model responses are incorrect (unlike SPIN).
 ## Prompt formats
@@ -81,7 +73,8 @@ The format for TinyCoT was:
 | [MPT 7B Instruct](https://hf.co/mosaicml/mpt-7b-instruct)              | **7B** | **Human**+Anthropic | SFT           |    2.05%       | 24.12%                                  | 11.01%                          |
 | [OpenLLaMA 7B v2 open-instruct](http://hf.co/VMware/open-llama-7b-v2-open-instruct) | **7B** | **Human** (nearly: ecqa is an exception) | SFT | 8.64% | 23.21%                   | 29.84%                          |
 | [StableLM Zephyr 3B](https://hf.co/stabilityai/stablelm-zephyr-3b)     | 3B     | GPT                 | DPO           |    possibly contaminated (45.72%)  | **33.31%**                   | 0.91%                           |
-| [**Memphis-CoT 3B**](https://hf.co/euclaise/Memphis-CoT-3B)            | 3B     | **Human**           | Self-teaching |    **13.8%**       | *26.24%*                            | **38.24%**                      |
 *5-shot, as performed automatically by LM Evaluation Harness bbh_cot_fewshot even with num_fewshot=0
 Memphis outperforms other primarily-human-data models that are over twice its size, along with SFT models of its size, and trades with the Zephyr DPO model. That said, Zephyr uses synthetic data, and *much* more of it.
@@ -121,7 +114,7 @@ For the rank finetuning:
 - Adalite optimizer, default hyperparameters of supertrainer2000 unless otherwise specified
 - Lambda of 0.01
 - LR of 5e-7
-- Rank loss weight of 5
 - Sequence length of 1024
 - Cosine schedule with 10% warmup
 - Frozen embeddings

 - euclaise/TinyCoT
 - euclaise/reddit-instruct
 - sablo/oasst2_curated
+- euclaise/SciCoT
 metrics:
 - accuracy
 ---
+*Now with a training bug fixed!*
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64137e2150358a805203cbac/DlTWku8gant1yx6NaxqJX.png)
+Memphis-CoT is a finetune of [StableLM 3b 4e1t](stabilityai/stablelm-3b-4e1t) on [TinyCoT](https://huggingface.co/datasets/euclaise/TinyCoT), [SciCoT](https://huggingface.co/datasets/euclaise/SciCoT), along with [reddit-instruct](https://huggingface.co/datasets/euclaise/reddit-instruct) (subset to 5000 examples, excluding posts with brackets in the title) and a [curated](https://huggingface.co/datasets/sablo/oasst2_curated) subset of [oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2).
 **Memphis was trained *only* on human data! No GPT generations here.**
 First, I finetuned the model on all the datasets using a [MixCE](https://arxiv.org/abs/2305.16958) loss and [NEFTune](https://arxiv.org/abs/2310.05914), for 2 epochs.
+I then performed the following steps 3 times:
 1. Generate responses for each question in TinyCoT using the current model, check each response for correctness, and create a dataset of (correct, incorrect) pairs. Extra values are discarded, such that each correct and incorrect response is unique.
+2. Finetune the model for 1 epoch using a ranking loss over length-normalized log-probabilities of each sequence, similar to [Preference Ranking Optimization](https://arxiv.org/abs/2306.17492), comparing the correct vs incorrect generated response. Additionally, a standard CE loss over the chosen completion was included.
 This should be more efficient than either STaR or SPIN, as it uses a ranking loss rather than rejection sampling (unlike STaR), and verifies correctness instead of assuming all model responses are incorrect (unlike SPIN).
+To prevent excessive drift, I kept the model weights as a moving average: After each generate+train cycle, I interpolated between the previous model weights and the updated weights using spherical linear interpolation (SLERP), with an interpolation factor of 0.99.
 ## Prompt formats
 | [MPT 7B Instruct](https://hf.co/mosaicml/mpt-7b-instruct)              | **7B** | **Human**+Anthropic | SFT           |    2.05%       | 24.12%                                  | 11.01%                          |
 | [OpenLLaMA 7B v2 open-instruct](http://hf.co/VMware/open-llama-7b-v2-open-instruct) | **7B** | **Human** (nearly: ecqa is an exception) | SFT | 8.64% | 23.21%                   | 29.84%                          |
 | [StableLM Zephyr 3B](https://hf.co/stabilityai/stablelm-zephyr-3b)     | 3B     | GPT                 | DPO           |    possibly contaminated (45.72%)  | **33.31%**                   | 0.91%                           |
+| [**Memphis-CoT 3B**](https://hf.co/euclaise/Memphis-CoT-3B)            | 3B     | **Human**           | Self-teaching |    **18.8%**       | *27.22%*                            | **36.92%**                      |
 *5-shot, as performed automatically by LM Evaluation Harness bbh_cot_fewshot even with num_fewshot=0
 Memphis outperforms other primarily-human-data models that are over twice its size, along with SFT models of its size, and trades with the Zephyr DPO model. That said, Zephyr uses synthetic data, and *much* more of it.
 - Adalite optimizer, default hyperparameters of supertrainer2000 unless otherwise specified
 - Lambda of 0.01
 - LR of 5e-7
+- Rank loss weight of 0.25
 - Sequence length of 1024
 - Cosine schedule with 10% warmup
 - Frozen embeddings