Update README.md
Browse files
README.md
CHANGED
@@ -37,7 +37,7 @@ I directly finetuned it on these examples, using a MixCE loss with a mixing rati
|
|
37 |
|
38 |
Finetuning on top of finetunes this way tends to lead to catastrophic forgetting - and indeed I observed significant degregation of the resultant model on e.g. GSM8K.
|
39 |
|
40 |
-
A common strategy to prevent catastrophic foregtting is weight averaging. In the LM community, 'merges' also utilize weight averaging, and spherical linear interpolation (SLERP) is considered to be superior to linear averaging. Accordingly, I used
|
41 |
|
42 |
This resulted in a model that has learned from the new data, without completely forgetting what it has learned from the original Memphis-CoT training.
|
43 |
|
|
|
37 |
|
38 |
Finetuning on top of finetunes this way tends to lead to catastrophic forgetting - and indeed I observed significant degregation of the resultant model on e.g. GSM8K.
|
39 |
|
40 |
+
A common strategy to prevent catastrophic foregtting is weight averaging. In the LM community, 'merges' also utilize weight averaging, and spherical linear interpolation (SLERP) is considered to be superior to linear averaging. Accordingly, I used SLERP to average the resultant model back with the original Memphis-CoT model.
|
41 |
|
42 |
This resulted in a model that has learned from the new data, without completely forgetting what it has learned from the original Memphis-CoT training.
|
43 |
|