SanjiWatsuki commited on
Commit
9be1efd
1 Parent(s): ed9a94e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -0
README.md CHANGED
@@ -3,3 +3,11 @@ license: other
3
  license_name: microsoft-research-license
4
  license_link: LICENSE
5
  ---
 
 
 
 
 
 
 
 
 
3
  license_name: microsoft-research-license
4
  license_link: LICENSE
5
  ---
6
+
7
+ This model is an experiment involving mixing DARE TIE merger with a task arithmetic merger to attempt to merge models with less loss.
8
+
9
+ DARE TIE mergers are [very strong at transferring strengths](https://medium.com/@minh.hoque/paper-explained-language-models-are-super-mario-2ebce6c2cf35) while merging a minimal part of the model. For larger models, 90-99% of delta parameters from SFT models can be dropped while retaining most of the benefits if they are rescaled and consensus merged back into the model.
10
+
11
+ For 7B models, we can't drop as many of the parameters and retain the model's strengths. In the original paper, the WizardMath model showed transferrable skills when 90% of the parameters were dropped but showed more strength when 70% were dropped. Experimentally, it appears that [even lower drop rates like 40%](https://github.com/cg123/mergekit/issues/26) have performed the best even for larger 34B models. In some instances, [even densities as high as 80% create an unstable merger](https://huggingface.co/jan-hq/supermario-v1), making DARE TIES unsuitable for merging models.
12
+
13
+ This is an experiment utilizing two merger techniques together to try and transfer skills between finetuned models. If we were to DARE TIE a low density merger onto the base Mistral model and then task arithmetic merge those low density delta weights onto a finetune, could we still achieve skill transfer?