euclaise commited on
Commit
290b2ad
1 Parent(s): 8d52924

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -37,7 +37,7 @@ I directly finetuned it on these examples, using a MixCE loss with a mixing rati
37
 
38
  Finetuning on top of finetunes this way tends to lead to catastrophic forgetting - and indeed I observed significant degregation of the resultant model on e.g. GSM8K.
39
 
40
- A common strategy to prevent catastrophic foregtting is weight averaging. In the LM community, 'merges' also utilize weight averaging, and spherical linear interpolation (SLERP) is considered to be superior to linear averaging. Accordingly, I used spherical SLERP to average the resultant model back with the original Memphis-CoT model.
41
 
42
  This resulted in a model that has learned from the new data, without completely forgetting what it has learned from the original Memphis-CoT training.
43
 
 
37
 
38
  Finetuning on top of finetunes this way tends to lead to catastrophic forgetting - and indeed I observed significant degregation of the resultant model on e.g. GSM8K.
39
 
40
+ A common strategy to prevent catastrophic foregtting is weight averaging. In the LM community, 'merges' also utilize weight averaging, and spherical linear interpolation (SLERP) is considered to be superior to linear averaging. Accordingly, I used SLERP to average the resultant model back with the original Memphis-CoT model.
41
 
42
  This resulted in a model that has learned from the new data, without completely forgetting what it has learned from the original Memphis-CoT training.
43