Text Generation
Transformers
Safetensors
stablelm_epoch
supertrainer2000
custom_code
Edit model card

image/png

Memphis-scribe 3B alpha is a finetune of Memphis-CoT 3B on more creative data, which itself is a finetune of StableLM 3B 4e1t.

It is trained further on TinyCoT, but also on

Training procedure

I started from Memphis-CoT 3B, which used a novel iterative contrastive finetuning procedure to improve reasoning ability.

I directly finetuned it on these examples, using a MixCE loss with a mixing ratio of 0.5.

Finetuning on top of finetunes this way tends to lead to catastrophic forgetting - and indeed I observed significant degregation of the resultant model on e.g. GSM8K.

A common strategy to prevent catastrophic foregtting is weight averaging. In the LM community, 'merges' also utilize weight averaging, and spherical linear interpolation (SLERP) is considered to be superior to linear averaging. Accordingly, I used SLERP to average the resultant model back with the original Memphis-CoT model.

This resulted in a model that has learned from the new data, without completely forgetting what it has learned from the original Memphis-CoT training.

Benchmarks

This model performs significantly worse than Memphis-CoT on benchmarks, despite being better suited to chat and creative writing tasks. This is an expected tradeoff.

Model GSM8K (5-shot) AGIEval (English/Nous subset, acc_norm) BIG Bench Hard (CoT, few-shot*)
StableLM 3B Base 2.05% 25.14% 36.75%
Memphis-CoT 3B 13.8% 26.24% 38.24%
Memphis-scribe 3B alpha 12.28% 23.92% 38.1%
*5-shot, as performed automatically by LM Evaluation Harness bbh_cot_fewshot even with num_fewshot=0
Downloads last month
2
Safetensors
Model size
2.8B params
Tensor type
BF16
·
Inference Examples
Inference API (serverless) does not yet support model repos that contain custom code.

Finetuned from

Datasets used to train euclaise/Memphis-scribe-3B-alpha