Papers
arxiv:2405.09673

LoRA Learns Less and Forgets Less

Published on May 15
· Submitted by akhaliq on May 17
#2 Paper of the day
Authors:
,
,
,

Abstract

Low-Rank Adaptation (LoRA) is a widely-used parameter-efficient finetuning method for large language models. LoRA saves memory by training only low rank perturbations to selected weight matrices. In this work, we compare the performance of LoRA and full finetuning on two target domains, programming and mathematics. We consider both the instruction finetuning (approx100K prompt-response pairs) and continued pretraining (approx10B unstructured tokens) data regimes. Our results show that, in most settings, LoRA substantially underperforms full finetuning. Nevertheless, LoRA exhibits a desirable form of regularization: it better maintains the base model's performance on tasks outside the target domain. We show that LoRA provides stronger regularization compared to common techniques such as weight decay and dropout; it also helps maintain more diverse generations. We show that full finetuning learns perturbations with a rank that is 10-100X greater than typical LoRA configurations, possibly explaining some of the reported gaps. We conclude by proposing best practices for finetuning with LoRA.

Community

why doesn't the target modules of “MLP” include gate_proj, but only up_proj and down_proj?

Paper author

Good question. We mention it at page 3 footnote 1. We excluded it for historical reasons and comparing to other 7b transformer architectures without gate_proj. Given the results, I anticipate that including it should increase target domain performance but at the same time increase forgetting.

Is it possible to release the code about this interesting paper?

Interesting research! Thanks for publishing it. The hypothesis around higher rank perbutations being associated with harder problems is particularly interesting to me.

I may be missing something, but have you considered benchmarking LoRa vs Full Finetuning against compute cost / time rather than epochs / amount of tokens. I think that might give a more fair reflection of performance differences for some typical use cases (e.g. where sample efficiency may be less important, and your finetuning is intended to guide something like response style). I think it would be cool to see if the greater sample effiency of full finetuning is say some function of total number of parameters updated, or if low rank weight update matrices are actually unable to be sample efficient, due to the problem requiring higher rank updates.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2405.09673 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2405.09673 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2405.09673 in a Space README.md to link it from this page.

Collections including this paper 24