FPHam
/

LORA-secrets

Model card Files Files and versions Community

FPHam commited on Oct 26, 2023

Commit

0d7217c

•

1 Parent(s): db99fb0

Update README.md

Files changed (1) hide show

README.md +2 -0

README.md CHANGED Viewed

@@ -26,6 +26,8 @@ The training parameters are there not to ruin it - not make it better, so you do
 IMHO gradient accumulation will LOWER the quality if you can do more than a few batches. There may be sweet spot somewehere, but IDK. Sure batch 1 and GA 32 will be better than batch 1 and GA 1, but that's not the point, that's a bandaid
 size of dataset matters when you are finetuning on base, but matters less when finetuning on well finetuned model. - in fact sometimes less is better in that case or you may be ruining a good previous finetuning.
 alpha = 2x rank seems like something that came from the old times when people had potato VRAM at most. I really don't feel like it makes much sense - it multiplies the weights and that's it. (check the PEFT code) Making things louder, makes also noise louder.

 IMHO gradient accumulation will LOWER the quality if you can do more than a few batches. There may be sweet spot somewehere, but IDK. Sure batch 1 and GA 32 will be better than batch 1 and GA 1, but that's not the point, that's a bandaid
+Edit: It could prevent overfitting though and hence help with generalization. It depends what is the goal and how diverse the dataset is.
 size of dataset matters when you are finetuning on base, but matters less when finetuning on well finetuned model. - in fact sometimes less is better in that case or you may be ruining a good previous finetuning.
 alpha = 2x rank seems like something that came from the old times when people had potato VRAM at most. I really don't feel like it makes much sense - it multiplies the weights and that's it. (check the PEFT code) Making things louder, makes also noise louder.