Lambent commited on
Commit
3ca31ed
1 Parent(s): 562bed5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -1
README.md CHANGED
@@ -31,11 +31,13 @@ Method by method comparison, initial evaluation loss on Cosmopedia data:
31
  * Qlora fine-tuning, rank 256, scale factor 1, batch 8: 1.102
32
  * Galore tuning, rank 256, scale factor 1, batch 8: 1.182
33
  * This Model Stock merge of all 4 training methods: 1.038
 
34
  * Control (cosmo-1b): 1.003
35
 
36
  Training set validation results:
37
 
38
  * Cosmo-1b Starting Eval Loss: ~0.65
 
39
  * Model Stock Loss: 0.40211
40
  * LISA Loss: 0.2534
41
  * GaLore Loss: 0.2426
@@ -45,7 +47,6 @@ Training set validation results:
45
  Overall ... not sure what to make of this, beyond that high-rank QLoRA is doing something particularly impressive while using only like 6GB of vRAM.
46
  The Model Stock merge between the 4 different tuning methods clearly recovered a lot of original knowledge, at the cost of something like half the adaptation to new data.
47
  Of course, cosmo-1b was already pretty good at predicting the new data, narrow and task-focused as it was.
48
- Might want to try another stock merge that is less busy trying to fix the full tuning's forgetfulness.
49
 
50
  ## Merge Details
51
  ### Merge Method
 
31
  * Qlora fine-tuning, rank 256, scale factor 1, batch 8: 1.102
32
  * Galore tuning, rank 256, scale factor 1, batch 8: 1.182
33
  * This Model Stock merge of all 4 training methods: 1.038
34
+ * Model Stock 3/4 Methods (all except full tuning): 1.021
35
  * Control (cosmo-1b): 1.003
36
 
37
  Training set validation results:
38
 
39
  * Cosmo-1b Starting Eval Loss: ~0.65
40
+ * Model Stock 3/4 Loss: 0.451
41
  * Model Stock Loss: 0.40211
42
  * LISA Loss: 0.2534
43
  * GaLore Loss: 0.2426
 
47
  Overall ... not sure what to make of this, beyond that high-rank QLoRA is doing something particularly impressive while using only like 6GB of vRAM.
48
  The Model Stock merge between the 4 different tuning methods clearly recovered a lot of original knowledge, at the cost of something like half the adaptation to new data.
49
  Of course, cosmo-1b was already pretty good at predicting the new data, narrow and task-focused as it was.
 
50
 
51
  ## Merge Details
52
  ### Merge Method