FPHam commited on
Commit
0d7217c
1 Parent(s): db99fb0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -0
README.md CHANGED
@@ -26,6 +26,8 @@ The training parameters are there not to ruin it - not make it better, so you do
26
 
27
  IMHO gradient accumulation will LOWER the quality if you can do more than a few batches. There may be sweet spot somewehere, but IDK. Sure batch 1 and GA 32 will be better than batch 1 and GA 1, but that's not the point, that's a bandaid
28
 
 
 
29
  size of dataset matters when you are finetuning on base, but matters less when finetuning on well finetuned model. - in fact sometimes less is better in that case or you may be ruining a good previous finetuning.
30
 
31
  alpha = 2x rank seems like something that came from the old times when people had potato VRAM at most. I really don't feel like it makes much sense - it multiplies the weights and that's it. (check the PEFT code) Making things louder, makes also noise louder.
 
26
 
27
  IMHO gradient accumulation will LOWER the quality if you can do more than a few batches. There may be sweet spot somewehere, but IDK. Sure batch 1 and GA 32 will be better than batch 1 and GA 1, but that's not the point, that's a bandaid
28
 
29
+ Edit: It could prevent overfitting though and hence help with generalization. It depends what is the goal and how diverse the dataset is.
30
+
31
  size of dataset matters when you are finetuning on base, but matters less when finetuning on well finetuned model. - in fact sometimes less is better in that case or you may be ruining a good previous finetuning.
32
 
33
  alpha = 2x rank seems like something that came from the old times when people had potato VRAM at most. I really don't feel like it makes much sense - it multiplies the weights and that's it. (check the PEFT code) Making things louder, makes also noise louder.