Update README.md
Browse files
README.md
CHANGED
@@ -31,11 +31,13 @@ Method by method comparison, initial evaluation loss on Cosmopedia data:
|
|
31 |
* Qlora fine-tuning, rank 256, scale factor 1, batch 8: 1.102
|
32 |
* Galore tuning, rank 256, scale factor 1, batch 8: 1.182
|
33 |
* This Model Stock merge of all 4 training methods: 1.038
|
|
|
34 |
* Control (cosmo-1b): 1.003
|
35 |
|
36 |
Training set validation results:
|
37 |
|
38 |
* Cosmo-1b Starting Eval Loss: ~0.65
|
|
|
39 |
* Model Stock Loss: 0.40211
|
40 |
* LISA Loss: 0.2534
|
41 |
* GaLore Loss: 0.2426
|
@@ -45,7 +47,6 @@ Training set validation results:
|
|
45 |
Overall ... not sure what to make of this, beyond that high-rank QLoRA is doing something particularly impressive while using only like 6GB of vRAM.
|
46 |
The Model Stock merge between the 4 different tuning methods clearly recovered a lot of original knowledge, at the cost of something like half the adaptation to new data.
|
47 |
Of course, cosmo-1b was already pretty good at predicting the new data, narrow and task-focused as it was.
|
48 |
-
Might want to try another stock merge that is less busy trying to fix the full tuning's forgetfulness.
|
49 |
|
50 |
## Merge Details
|
51 |
### Merge Method
|
|
|
31 |
* Qlora fine-tuning, rank 256, scale factor 1, batch 8: 1.102
|
32 |
* Galore tuning, rank 256, scale factor 1, batch 8: 1.182
|
33 |
* This Model Stock merge of all 4 training methods: 1.038
|
34 |
+
* Model Stock 3/4 Methods (all except full tuning): 1.021
|
35 |
* Control (cosmo-1b): 1.003
|
36 |
|
37 |
Training set validation results:
|
38 |
|
39 |
* Cosmo-1b Starting Eval Loss: ~0.65
|
40 |
+
* Model Stock 3/4 Loss: 0.451
|
41 |
* Model Stock Loss: 0.40211
|
42 |
* LISA Loss: 0.2534
|
43 |
* GaLore Loss: 0.2426
|
|
|
47 |
Overall ... not sure what to make of this, beyond that high-rank QLoRA is doing something particularly impressive while using only like 6GB of vRAM.
|
48 |
The Model Stock merge between the 4 different tuning methods clearly recovered a lot of original knowledge, at the cost of something like half the adaptation to new data.
|
49 |
Of course, cosmo-1b was already pretty good at predicting the new data, narrow and task-focused as it was.
|
|
|
50 |
|
51 |
## Merge Details
|
52 |
### Merge Method
|