VB-LoRA: Extreme Parameter Efficient Fine-Tuning with Vector Banks
Overview
VB-LoRA is a parameter-efficient fine-tuning technique that extends LoRA by learning a fine-grained parameter-sharing scheme at the sub-vector level, achieving significantly higher parameter efficiency. This makes VB-LoRA especially useful in scenarios where storage and transmission costs are critical. It works by decomposing low-rank matrices—from different layers and modules such as K, Q, V, and FFN—into sub-vectors, which are then globally shared through a vector bank.
The abstract from the paper is:
As the adoption of large language models increases and the need for per-user or per-task model customization grows, the parameter-efficient fine-tuning (PEFT) methods, such as low-rank adaptation (LoRA) and its variants, incur substantial storage and transmission costs. To further reduce stored parameters, we introduce a “divide-and-share” paradigm that breaks the barriers of low-rank decomposition across matrix dimensions, modules and layers by sharing parameters globally via a vector bank. As an instantiation of the paradigm to LoRA, our proposed VB-LoRA composites all the low-rank matrices of LoRA from a shared vector bank with a differentiable top-k admixture module. VB-LoRA achieves extreme parameter efficiency while maintaining comparable or better performance compared to state-of-the-art PEFT methods. Extensive experiments demonstrate the effectiveness of VB-LoRA on natural language understanding, natural language generation, and instruction tuning tasks. When fine-tuning the Llama2-13B model, VB-LoRA only uses 0.4% of LoRA’s stored parameters, yet achieves superior results.
Usage Tips
VB-LoRA utilizes a sparse top-k module to learn the sharing machanism. When saving adapter parameters, you can either save only the top-k weights and their indices by setting
save_only_topk_weights = True
inVBLoRAConfig
, or save all the trainable logits by setting it toFalse
. Enablingsave_only_topk_weights = True
significantly reduces storage space; for instance, in Llama2-7B, the storage file size decreases from 308MB to 2.5MB. Note that models saved withsave_only_topk_weights = True
are intended for merging or inference only and cannot be used to resume training.VB-LoRA has two sets of training parameters: vector bank parameters and logit parameters. In practice, we found that logit parameters require a higher learning rate, while vector bank parameters require a lower learning rate. When using the AdamW optimizer, typical learning rates are 0.01 for logits and 0.001 for vector bank parameters.
VBLoRAConfig
class peft.VBLoRAConfig
< source >( task_type: typing.Union[str, peft.utils.peft_types.TaskType, NoneType] = None peft_type: typing.Union[str, peft.utils.peft_types.PeftType, NoneType] = None auto_mapping: typing.Optional[dict] = None base_model_name_or_path: typing.Optional[str] = None revision: typing.Optional[str] = None inference_mode: bool = False r: int = 4 num_vectors: int = 256 vector_length: int = 256 topk: int = 2 target_modules: Optional[Union[list[str], str]] = None exclude_modules: Optional[Union[list[str], str]] = None save_only_topk_weights: bool = False vblora_dropout: float = 0.0 fan_in_fan_out: bool = False bias: str = 'none' modules_to_save: Optional[list[str]] = None init_vector_bank_bound: float = 0.02 init_logits_std: float = 0.1 layers_to_transform: Optional[Union[list[int], int]] = None layers_pattern: Optional[Union[list[str], str]] = None )
Parameters
- r (
int
) — The rank of incremental matrices. - num_vectors (
int
) — Number of vectors in the vector bank. Use higher values when the model size increases. - vector_length (
int
) — The length of the vectors in the vector bank. The length of the vectors should be divisible by the hidden dimension of the model. - topk (
int
) — The K value for top-K selection. A larger value of K increases the size of the saved model. In practice, setting K=2 typically provides the best performance and parameter efficiency. For more details, refer to the discussion in the paper. - target_modules (
Union[List[str], str]
) — The names of the modules to apply the adapter to. If this is specified, only the modules with the specified names will be replaced. When passing a string, a regex match will be performed. When passing a list of strings, either an exact match will be performed or it is checked if the name of the module ends with any of the passed strings. If this is specified as ‘all-linear’, then all linear/Conv1D modules are chosen, excluding the output layer. If this is not specified, modules will be chosen according to the model architecture. If the architecture is not known, an error will be raised — in this case, you should specify the target modules manually. - exclude_modules (
Optional[Union[List[str], str]]
) — The names of the modules to not apply the adapter. When passing a string, a regex match will be performed. When passing a list of strings, either an exact match will be performed or it is checked if the name of the module ends with any of the passed strings. - save_only_topk_weights (
bool
) — Whether to only save the topk weights. Settingsave_only_topk_weights = True
significantly reduces storage space. However, models saved in this mode can be used for merging or inference only, not for resuming training. - vblora_dropout (
float
) — The dropout probability for VBLoRA layers. - fan_in_fan_out (
bool
) — Set this to True if the layer to replace stores weight like (fan_in, fan_out). For example, gpt-2 usesConv1D
which stores weights like (fan_in, fan_out) and hence this should be set toTrue
. - bias (
str
) — Bias type for VBLoRA. Can be ‘none’, ‘all’ or ‘vblora_only’. If ‘all’ or ‘vblora_only’, the corresponding biases will be updated during training. Be aware that this means that, even when disabling the adapters, the model will not produce the same output as the base model would have without adaptation. - modules_to_save (
List[str]
) — List of modules apart from VBLoRA layers to be set as trainable and saved in the final checkpoint. - init_vector_bank_bound (
float
) — The vector bank is initialized with a uniform distribution between -init_vector_bank_bound and init_vector_bank_bound. Avoid initializing the vector bank with all zeros to prevent zero gradients. A small value, such as 0.02, is typically effective. Initializing with a large value may cause training instability. - init_logits_std (
float
) — The logits are initialized with a normal distribution with a standard deviation of init_logits_std. Default is 0.1. - layers_to_transform (
Union[List[int],int]
) — The layer indices to transform. If a list of ints is passed, it will apply the adapter to the layer indices that are specified in this list. If a single integer is passed, it will apply the transformations on the layer at this index. - layers_pattern (
Optional[Union[List[str], str]]
) — The layer pattern name, used only iflayers_to_transform
is different fromNone
. This should target thenn.ModuleList
of the model, which is often called'layers'
or'h'
.
This is the configuration class to store the configuration of a VBLoRAConfig.
VBLoRAModel
class peft.VBLoRAModel
< source >( model config adapter_name low_cpu_mem_usage: bool = False ) → torch.nn.Module
Parameters
- model (PreTrainedModel) — The model to be adapted.
- config (VBLoRAConfig) — The configuration of the VBLoRA model.
- adapter_name (
str
) — The name of the adapter, defaults to"default"
. - low_cpu_mem_usage (
bool
,optional
, defaults toFalse
) — Create empty adapter weights on meta device. Useful to speed up the loading process.
Returns
torch.nn.Module
The VBLoRA model.
Creates VBLoRA model from a pretrained transformers model.
The method is described in detail in https://arxiv.org/abs/2405.15179.
Example:
>>> from transformers import AutoModelForCausalLM
>>> from peft import VBLoRAConfig, get_peft_model
>>> base_model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m")
>>> config = VBLoRAConfig(
... task_type="SEQ_CLS",
... r=4,
... target_modules=["fc1", "fc2", "k_proj", "out_proj", "q_proj", "v_proj"],
... num_vectors=60,
... vector_length=256,
... save_only_topk_weights=True,
... )
>>> model = get_peft_model(base_model, config)
Attributes:
- model (PreTrainedModel) — The model to be adapted.
- peft_config (VBLoRAConfig): The configuration of the VBLoRAConfig model.
delete_adapter
< source >( adapter_name: str )
Deletes an existing adapter.
Disable all adapters.
When disabling all adapters, the model output corresponds to the output of the base model.
Enable all adapters.
Call this if you have previously disabled all adapters and want to re-enable them.
Returns the number of savable VB-LoRA parameters and other savable parameters.
merge_and_unload
< source >( progressbar: bool = False safe_merge: bool = False adapter_names: Optional[list[str]] = None )
Parameters
- progressbar (
bool
) — whether to show a progressbar indicating the unload and merge process - safe_merge (
bool
) — whether to activate the safe merging check to check if there is any potential Nan in the adapter weights - adapter_names (
list[str]
, optional) — The list of adapter names that should be merged. If None, all active adapters will be merged. Defaults toNone
.
This method merges the VBLoRA layers into the base model. This is needed if someone wants to use the base model as a standalone model.
Example:
>>> from transformers import AutoModelForCausalLM
>>> from peft import PeftModel
>>> base_model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-40b")
>>> peft_model_id = "smangrul/falcon-40B-int4-peft-lora-sfttrainer-sample"
>>> model = PeftModel.from_pretrained(base_model, peft_model_id)
>>> merged_model = model.merge_and_unload()
Prints the number of savable VB-LoRA parameters and total savable parameters.
set_adapter
< source >( adapter_name: str | list[str] )
Set the active adapter(s).
Additionally, this function will set the specified adapters to trainable (i.e., requires_grad=True). If this is not desired, use the following code.
Gets back the base model by removing all the VBLoRA modules without merging. This gives back the original base model.