Lambent
/

cosmo-1b-stock-pythontest

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Lambent commited on Apr 16

Commit

3ca31ed

•

1 Parent(s): 562bed5

Update README.md

Files changed (1) hide show

README.md +2 -1

README.md CHANGED Viewed

@@ -31,11 +31,13 @@ Method by method comparison, initial evaluation loss on Cosmopedia data:
 * Qlora fine-tuning, rank 256, scale factor 1, batch 8: 1.102
 * Galore tuning, rank 256, scale factor 1, batch 8: 1.182
 * This Model Stock merge of all 4 training methods: 1.038
 * Control (cosmo-1b): 1.003
 Training set validation results:
 * Cosmo-1b Starting Eval Loss: ~0.65
 * Model Stock Loss: 0.40211
 * LISA Loss: 0.2534
 * GaLore Loss: 0.2426
@@ -45,7 +47,6 @@ Training set validation results:
 Overall ... not sure what to make of this, beyond that high-rank QLoRA is doing something particularly impressive while using only like 6GB of vRAM.
 The Model Stock merge between the 4 different tuning methods clearly recovered a lot of original knowledge, at the cost of something like half the adaptation to new data.
 Of course, cosmo-1b was already pretty good at predicting the new data, narrow and task-focused as it was.
-Might want to try another stock merge that is less busy trying to fix the full tuning's forgetfulness.
 ## Merge Details
 ### Merge Method

 * Qlora fine-tuning, rank 256, scale factor 1, batch 8: 1.102
 * Galore tuning, rank 256, scale factor 1, batch 8: 1.182
 * This Model Stock merge of all 4 training methods: 1.038
+* Model Stock 3/4 Methods (all except full tuning): 1.021
 * Control (cosmo-1b): 1.003
 Training set validation results:
 * Cosmo-1b Starting Eval Loss: ~0.65
+* Model Stock 3/4 Loss: 0.451
 * Model Stock Loss: 0.40211
 * LISA Loss: 0.2534
 * GaLore Loss: 0.2426
 Overall ... not sure what to make of this, beyond that high-rank QLoRA is doing something particularly impressive while using only like 6GB of vRAM.
 The Model Stock merge between the 4 different tuning methods clearly recovered a lot of original knowledge, at the cost of something like half the adaptation to new data.
 Of course, cosmo-1b was already pretty good at predicting the new data, narrow and task-focused as it was.
 ## Merge Details
 ### Merge Method