Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch
Abstract
In this paper, we uncover that Language Models (LMs), either encoder- or decoder-based, can obtain new capabilities by assimilating the parameters of homologous models without retraining or GPUs. Typically, new abilities of LMs can be imparted by Supervised Fine-Tuning (SFT), reflected in the disparity between fine-tuned and pre-trained parameters (i.e., delta parameters). We initially observe that by introducing a novel operation called DARE (Drop And REscale), most delta parameters can be directly set to zeros without affecting the capabilities of SFT LMs and larger models can tolerate a higher proportion of discarded parameters. Based on this observation, we further sparsify delta parameters of multiple SFT homologous models with DARE and subsequently merge them into a single model by parameter averaging. We conduct experiments on eight datasets from the GLUE benchmark with BERT and RoBERTa. We also merge WizardLM, WizardMath, and Code Alpaca based on Llama 2. Experimental results show that: (1) The delta parameter value ranges for SFT models are typically small, often within 0.005, and DARE can eliminate 99% of them effortlessly. However, once the models are continuously pre-trained, the value ranges can grow to around 0.03, making DARE impractical. We have also tried to remove fine-tuned instead of delta parameters and find that a 10% reduction can lead to drastically decreased performance (even to 0). This highlights that SFT merely stimulates the abilities via delta parameters rather than injecting new abilities into LMs; (2) DARE can merge multiple task-specific LMs into one LM with diverse abilities. For instance, the merger of WizardLM and WizardMath improves the GSM8K zero-shot accuracy of WizardLM from 2.2 to 66.3, retaining its instruction-following ability while surpassing WizardMath's original 64.2 performance. Codes are available at https://github.com/yule-BUAA/MergeLM.
Community
I would have really liked a citation for this π
Human beings have always expressed their ambition to acquire additional abilities through various ways such as movies and games. For example, in X-Menβs Apocalypse, the character can absorb the powers of other mutants to strengthen himself. Likewise, the protagonist in the Super Mario games can gain superpowers like throwing fireballs by absorbing in-game items.
This is a really cool idea. Id be interested to see some similar work on merging LoRAs (or quantized variants) as this lowers the barrier to entry of fine-tuning by a lot.
This is a really cool idea. Id be interested to see some similar work on merging LoRAs (or quantized variants) as this lowers the barrier to entry of fine-tuning by a lot.
Definitely!
@julien-c is there any efforts to integrate this amazing operation DARE with transformers or perhaps TRL ?
Has anyone investigated why the author's chose 13B models for the merge studies? In a preliminary study I did trying DARE (p=0.2, 0.8) using avg. merging for WizardLM-7B and WizardMath-7B results in incoherent garbage.
Has anyone investigated why the author's chose 13B models for the merge studies? In a preliminary study I did trying DARE (p=0.2, 0.8) using avg. merging for WizardLM-7B and WizardMath-7B results in incoherent garbage.
Hi,
I am the author of this paper.
The premise for model merging is that all the models are fine-tuned based on the same backbone. In your case, WizardLM-7B is fine-tuned from llama-7b-hf while WizardMath-7B is fine-tuned from Llama-2-7b-hf. Hence, these two models cannot be merged due to their different backbones.
I hope this answer can address your concerns. π€
That makes sense! Thank you @vanillaOVO
Models citing this paper 659
Browse 659 models citing this paperDatasets citing this paper 0
No dataset linking this paper