euclaise
/

Memphis-scribe-3B-alpha

Text Generation

supertrainer2000

Model card Files Files and versions Community

euclaise commited on Jan 31, 2024

Commit

290b2ad

·

verified ·

1 Parent(s): 8d52924

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -37,7 +37,7 @@ I directly finetuned it on these examples, using a MixCE loss with a mixing rati
 Finetuning on top of finetunes this way tends to lead to catastrophic forgetting - and indeed I observed significant degregation of the resultant model on e.g. GSM8K.
-A common strategy to prevent catastrophic foregtting is weight averaging. In the LM community, 'merges' also utilize weight averaging, and spherical linear interpolation (SLERP) is considered to be superior to linear averaging.  Accordingly, I used spherical SLERP to average the resultant model back with the original Memphis-CoT model.
 This resulted in a model that has learned from the new data, without completely forgetting what it has learned from the original Memphis-CoT training.

 Finetuning on top of finetunes this way tends to lead to catastrophic forgetting - and indeed I observed significant degregation of the resultant model on e.g. GSM8K.
+A common strategy to prevent catastrophic foregtting is weight averaging. In the LM community, 'merges' also utilize weight averaging, and spherical linear interpolation (SLERP) is considered to be superior to linear averaging.  Accordingly, I used SLERP to average the resultant model back with the original Memphis-CoT model.
 This resulted in a model that has learned from the new data, without completely forgetting what it has learned from the original Memphis-CoT training.