jukofyork
/

creative-writer-v0.1-alfa-35b

Text Generation

creative-writing

creative-writer

multiplicative-lora

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

jukofyork commited on Oct 25

Commit

7299175

•

1 Parent(s): c5369fc

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -119,7 +119,7 @@ but whereas `lora_A` looks at the ***input*** to the transformation for "additiv
 - I used a much lower learning-rate of `5e-6`, as the `5e-5` value used by [Llama-3-70B-Instruct-Storywriter](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter/discussions/2#66524e7eb47c060e536889a3) dropped the evaluation loss *far* too quickly (likely due to adapting `down_proj` only being "almost convex").
 - I set `lora_dropout = 0.0` as it doesn't really make sense to use with `epochs = 1`.
 - I left `weight_decay = 0.01` but not convinced this is really doing anything useful, and may actually even be harming the adaption of the early `down_proj` matrices where the gradient signal is likely to be much weaker.
-- I found via experimentation that setting `lora_rank` and `lora_alpha` to a very low value (as a form of [Spectral Regularization](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter/discussions/2#66524e7eb47c060e536889a3)), can cause the training to get stuck at [saddle-points](https://en.wikipedia.org/wiki/Saddle_point); particularly if using stock SGD instead of Adam.
 - In general, I relied mainly on early stopping for Regularization and deliberately set out to *undertrain* the models (we can always increase the size of the dataset at a later time...).
 ## `config_creative_writer.toml`

 - I used a much lower learning-rate of `5e-6`, as the `5e-5` value used by [Llama-3-70B-Instruct-Storywriter](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter/discussions/2#66524e7eb47c060e536889a3) dropped the evaluation loss *far* too quickly (likely due to adapting `down_proj` only being "almost convex").
 - I set `lora_dropout = 0.0` as it doesn't really make sense to use with `epochs = 1`.
 - I left `weight_decay = 0.01` but not convinced this is really doing anything useful, and may actually even be harming the adaption of the early `down_proj` matrices where the gradient signal is likely to be much weaker.
+- I found via experimentation that setting `lora_rank` and `lora_alpha` to a very low value (as a form of [Spectral Regularization](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter/discussions/2#66524e7eb47c060e536889a3)), can cause the training to get stuck at [saddle-points](https://en.wikipedia.org/wiki/Saddle_point) as confirmed in [this](https://arxiv.org/abs/2402.11867) paper; particularly if using stock SGD instead of Adam.
 - In general, I relied mainly on early stopping for Regularization and deliberately set out to *undertrain* the models (we can always increase the size of the dataset at a later time...).
 ## `config_creative_writer.toml`