PEFT documentation

LoRA

PEFT

You are viewing v0.14.0 version. A newer version v0.18.0 is available.

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

LoRA

Low-Rank Adaptation (LoRA) is a PEFT method that decomposes a large matrix into two smaller low-rank matrices in the attention layers. This drastically reduces the number of parameters that need to be fine-tuned.

The abstract from the paper is:

We propose a neural language modeling system based on low-rank adaptation (LoRA) for speech recognition output rescoring. Although pretrained language models (LMs) like BERT have shown superior performance in second-pass rescoring, the high computational cost of scaling up the pretraining stage and adapting the pretrained models to specific domains limit their practical use in rescoring. Here we present a method based on low-rank decomposition to train a rescoring BERT model and adapt it to new domains using only a fraction (0.08%) of the pretrained parameters. These inserted matrices are optimized through a discriminative training objective along with a correlation-based regularization loss. The proposed low-rank adaptation Rescore-BERT (LoRB) architecture is evaluated on LibriSpeech and internal datasets with decreased training times by factors between 5.4 and 3.6..

LoraConfig

class peft.LoraConfig

( task_type: typing.Union[str, peft.utils.peft_types.TaskType, NoneType] = None peft_type: typing.Union[str, peft.utils.peft_types.PeftType, NoneType] = None auto_mapping: typing.Optional[dict] = None base_model_name_or_path: typing.Optional[str] = None revision: typing.Optional[str] = None inference_mode: bool = False r: int = 8 target_modules: Optional[Union[list[str], str]] = None exclude_modules: Optional[Union[list[str], str]] = None lora_alpha: int = 8 lora_dropout: float = 0.0 fan_in_fan_out: bool = False bias: Literal['none', 'all', 'lora_only'] = 'none' use_rslora: bool = False modules_to_save: Optional[list[str]] = None init_lora_weights: bool | Literal['gaussian', 'eva', 'olora', 'pissa', 'pissa_niter_[number of iters]', 'loftq'] = True layers_to_transform: Optional[Union[list[int], int]] = None layers_pattern: Optional[Union[list[str], str]] = None rank_pattern: Optional[dict] = <factory> alpha_pattern: Optional[dict] = <factory> megatron_config: Optional[dict] = None megatron_core: Optional[str] = 'megatron.core' loftq_config: Union[LoftQConfig, dict] = <factory> eva_config: Optional[EvaConfig] = None use_dora: bool = False layer_replication: Optional[list[tuple[int, int]]] = None runtime_config: LoraRuntimeConfig = <factory> lora_bias: bool = False )

Parameters

r (int) — Lora attention dimension (the “rank”).
target_modules (Optional[Union[List[str], str]]) — The names of the modules to apply the adapter to. If this is specified, only the modules with the specified names will be replaced. When passing a string, a regex match will be performed. When passing a list of strings, either an exact match will be performed or it is checked if the name of the module ends with any of the passed strings. If this is specified as ‘all-linear’, then all linear/Conv1D modules are chosen, excluding the output layer. If this is not specified, modules will be chosen according to the model architecture. If the architecture is not known, an error will be raised — in this case, you should specify the target modules manually.
exclude_modules (Optional[Union[List[str], str]]) — The names of the modules to not apply the adapter. When passing a string, a regex match will be performed. When passing a list of strings, either an exact match will be performed or it is checked if the name of the module ends with any of the passed strings.
lora_alpha (int) — The alpha parameter for Lora scaling.
lora_dropout (float) — The dropout probability for Lora layers.
fan_in_fan_out (bool) — Set this to True if the layer to replace stores weight like (fan_in, fan_out). For example, gpt-2 uses Conv1D which stores weights like (fan_in, fan_out) and hence this should be set to True.
bias (str) — Bias type for LoRA. Can be ‘none’, ‘all’ or ‘lora_only’. If ‘all’ or ‘lora_only’, the corresponding biases will be updated during training. Be aware that this means that, even when disabling the adapters, the model will not produce the same output as the base model would have without adaptation.
use_rslora (bool) — When set to True, uses Rank-Stabilized LoRA which sets the adapter scaling factor to lora_alpha/math.sqrt(r), since it was proven to work better. Otherwise, it will use the original default value of lora_alpha/r.
modules_to_save (List[str]) — List of modules apart from adapter layers to be set as trainable and saved in the final checkpoint.
init_lora_weights (bool | Literal["gaussian", "eva", "olora", "pissa", "pissa_niter_[number of iters]", "loftq"]) — How to initialize the weights of the adapter layers. Passing True (default) results in the default initialization from the reference implementation from Microsoft. Passing ‘gaussian’ results in Gaussian initialization scaled by the LoRA rank for linear and layers. Setting the initialization to False leads to completely random initialization and is discouraged. Pass 'loftq' to use LoftQ initialization. Passing 'eva' results in a data-driven initialization of https://arxiv.org/abs/2410.07170’ >Explained Variance Adaptation. EVA initalizes LoRA based on the SVD of layer input activations and achieves SOTA performance due to its ability to adapt to the finetuning data. Pass 'olora' to use OLoRA initialization. Passing 'pissa' results in the initialization of https://arxiv.org/abs/2404.02948’ >Principal Singular values and Singular vectors Adaptation (PiSSA), which converges more rapidly than LoRA and ultimately achieves superior performance. Moreover, PiSSA reduces the quantization error compared to QLoRA, leading to further enhancements. Passing 'pissa_niter_[number of iters]' initiates Fast-SVD-based PiSSA initialization, where [number of iters] indicates the number of subspace iterations to perform FSVD, and must be a nonnegative integer. When [number of iters] is set to 16, it can complete the initialization of a 7B model within seconds, and the training effect is approximately equivalent to using SVD.
layers_to_transform (Union[List[int], int]) — The layer indices to transform. If a list of ints is passed, it will apply the adapter to the layer indices that are specified in this list. If a single integer is passed, it will apply the transformations on the layer at this index.
layers_pattern (Optional[Union[List[str], str]]) — The layer pattern name, used only if layers_to_transform is different from None. This should target the nn.ModuleList of the model, which is often called 'layers' or 'h'.
rank_pattern (dict) — The mapping from layer names or regexp expression to ranks which are different from the default rank specified by r.
alpha_pattern (dict) — The mapping from layer names or regexp expression to alphas which are different from the default alpha specified by lora_alpha.
megatron_config (Optional[dict]) — The TransformerConfig arguments for Megatron. It is used to create LoRA’s parallel linear layer. You can get it like this, core_transformer_config_from_args(get_args()), these two functions being from Megatron. The arguments will be used to initialize the TransformerConfig of Megatron. You need to specify this parameter when you want to apply LoRA to the ColumnParallelLinear and RowParallelLinear layers of megatron.
megatron_core (Optional[str]) — The core module from Megatron to use, defaults to "megatron.core".
loftq_config (Optional[LoftQConfig]) — The configuration of LoftQ. If this is not None, then LoftQ will be used to quantize the backbone weights and initialize Lora layers. Also pass init_lora_weights='loftq'. Note that you should not pass a quantized model in this case, as LoftQ will quantize the model itself.
eva_config (Optional[EvaConfig]) — The configuration of EVA. At a minimum the dataset argument needs to be set (use the same dataset as for finetuning).
use_dora (bool) — Enable ‘Weight-Decomposed Low-Rank Adaptation’ (DoRA). This technique decomposes the updates of the weights into two parts, magnitude and direction. Direction is handled by normal LoRA, whereas the magnitude is handled by a separate learnable parameter. This can improve the performance of LoRA especially at low ranks. Right now, DoRA only supports linear and Conv2D layers. DoRA introduces a bigger overhead than pure LoRA, so it is recommended to merge weights for inference. For more information, see https://arxiv.org/abs/2402.09353.
layer_replication (List[Tuple[int, int]]) — Build a new stack of layers by stacking the original model layers according to the ranges specified. This allows expanding (or shrinking) the model without duplicating the base model weights. The new layers will all have separate LoRA adapters attached to them.
runtime_config (LoraRuntimeConfig) — Runtime configurations (which are not saved or restored).
lora_bias (bool) — Defaults to False. Whether to enable the bias term for the LoRA B parameter. Typically, this should be disabled. The main use case for this is when the LoRA weights were extracted from fully fine-tuned parameters so the bias of those parameters can be taken into account.

This is the configuration class to store the configuration of a LoraModel.

to_dict

( )

Returns the configuration for your adapter model as a dictionary. Removes runtime configurations.

LoraModel

class peft.LoraModel

( model config adapter_name low_cpu_mem_usage: bool = False ) → torch.nn.Module

Parameters

model (torch.nn.Module) — The model to be adapted.
config (LoraConfig) — The configuration of the Lora model.
adapter_name (str) — The name of the adapter, defaults to "default".
low_cpu_mem_usage (bool, optional, defaults to False) — Create empty adapter weights on meta device. Useful to speed up the loading process.

Returns

torch.nn.Module

The Lora model.

Creates Low Rank Adapter (LoRA) model from a pretrained transformers model.

The method is described in detail in https://arxiv.org/abs/2106.09685.

Example:

>>> from transformers import AutoModelForSeq2SeqLM
>>> from peft import LoraModel, LoraConfig

>>> config = LoraConfig(
...     task_type="SEQ_2_SEQ_LM",
...     r=8,
...     lora_alpha=32,
...     target_modules=["q", "v"],
...     lora_dropout=0.01,
... )

>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
>>> lora_model = LoraModel(model, config, "default")

>>> import torch
>>> import transformers
>>> from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training

>>> rank = ...
>>> target_modules = ["q_proj", "k_proj", "v_proj", "out_proj", "fc_in", "fc_out", "wte"]
>>> config = LoraConfig(
...     r=4, lora_alpha=16, target_modules=target_modules, lora_dropout=0.1, bias="none", task_type="CAUSAL_LM"
... )
>>> quantization_config = transformers.BitsAndBytesConfig(load_in_8bit=True)

>>> tokenizer = transformers.AutoTokenizer.from_pretrained(
...     "kakaobrain/kogpt",
...     revision="KoGPT6B-ryan1.5b-float16",  # or float32 version: revision=KoGPT6B-ryan1.5b
...     bos_token="[BOS]",
...     eos_token="[EOS]",
...     unk_token="[UNK]",
...     pad_token="[PAD]",
...     mask_token="[MASK]",
... )
>>> model = transformers.GPTJForCausalLM.from_pretrained(
...     "kakaobrain/kogpt",
...     revision="KoGPT6B-ryan1.5b-float16",  # or float32 version: revision=KoGPT6B-ryan1.5b
...     pad_token_id=tokenizer.eos_token_id,
...     use_cache=False,
...     device_map={"": rank},
...     torch_dtype=torch.float16,
...     quantization_config=quantization_config,
... )
>>> model = prepare_model_for_kbit_training(model)
>>> lora_model = get_peft_model(model, config)

Attributes:

model (PreTrainedModel) — The model to be adapted.
peft_config (LoraConfig): The configuration of the Lora model.

add_weighted_adapter

( adapters: list[str] weights: list[float] adapter_name: str combination_type: str = 'svd' svd_rank: int | None = None svd_clamp: int | None = None svd_full_matrices: bool = True svd_driver: str | None = None density: float | None = None majority_sign_method: Literal['total', 'frequency'] = 'total' )

Parameters

adapters (list) — List of adapter names to be merged.
weights (list) — List of weights for each adapter.
adapter_name (str) — Name of the new adapter.
combination_type (str) — The merging type can be one of [svd, linear, cat, ties, ties_svd, dare_ties, dare_linear, dare_ties_svd, dare_linear_svd, magnitude_prune, magnitude_prune_svd]. When using the cat combination_type, the rank of the resulting adapter is equal to the sum of all adapters ranks (the mixed adapter may be too big and result in OOM errors).
svd_rank (int, optional) — Rank of output adapter for svd. If None provided, will use max rank of merging adapters.
svd_clamp (float, optional) — A quantile threshold for clamping SVD decomposition output. If None is provided, do not perform clamping. Defaults to None.
svd_full_matrices (bool, optional) — Controls whether to compute the full or reduced SVD, and consequently, the shape of the returned tensors U and Vh. Defaults to True.
svd_driver (str, optional) — Name of the cuSOLVER method to be used. This keyword argument only works when merging on CUDA. Can be one of [None, gesvd, gesvdj, gesvda]. For more info please refer to torch.linalg.svd documentation. Defaults to None.
density (float, optional) — Value between 0 and 1. 0 means all values are pruned and 1 means no values are pruned. Should be used with [ties, ties_svd, dare_ties, dare_linear, dare_ties_svd, dare_linear_svd, magnintude_prune, magnitude_prune_svd]
majority_sign_method (str) — The method, should be one of [“total”, “frequency”], to use to get the magnitude of the sign values. Should be used with [ties, ties_svd, dare_ties, dare_ties_svd]

This method adds a new adapter by merging the given adapters with the given weights.

When using the cat combination_type you should be aware that rank of the resulting adapter will be equal to the sum of all adapters ranks. So it’s possible that the mixed adapter may become too big and result in OOM errors.

delete_adapter

( adapter_name: str )

Parameters

adapter_name (str) — Name of the adapter to be deleted.

Deletes an existing adapter.

disable_adapter_layers

( )

Disable all adapters.

When disabling all adapters, the model output corresponds to the output of the base model.

enable_adapter_layers

( )

Enable all adapters.

Call this if you have previously disabled all adapters and want to re-enable them.

merge_and_unload

( progressbar: bool = False safe_merge: bool = False adapter_names: Optional[list[str]] = None )

Parameters

progressbar (bool) — whether to show a progressbar indicating the unload and merge process
safe_merge (bool) — whether to activate the safe merging check to check if there is any potential Nan in the adapter weights
adapter_names (List[str], optional) — The list of adapter names that should be merged. If None, all active adapters will be merged. Defaults to None.

This method merges the LoRa layers into the base model. This is needed if someone wants to use the base model as a standalone model.

Example:

>>> from transformers import AutoModelForCausalLM
>>> from peft import PeftModel

>>> base_model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-40b")
>>> peft_model_id = "smangrul/falcon-40B-int4-peft-lora-sfttrainer-sample"
>>> model = PeftModel.from_pretrained(base_model, peft_model_id)
>>> merged_model = model.merge_and_unload()

set_adapter

( adapter_name: str | list[str] )

Parameters

adapter_name (str or list[str]) — Name of the adapter(s) to be activated.

Set the active adapter(s).

Additionally, this function will set the specified adapters to trainable (i.e., requires_grad=True). If this is not desired, use the following code.

>>> for name, param in model_peft.named_parameters():
...     if ...:  # some check on name (ex. if 'lora' in name)
...         param.requires_grad = False

subtract_mutated_init

( output_state_dict: dict[str, torch.Tensor] adapter_name: str kwargs = None )

This function can calculate the updates of the [PiSSA | OLoRA] by comparing the parameters of the [PiSSA | OLoRA] adapter in output_state_dict with the initial values of [PiSSA | OLoRA] in adapter_name, thus converting [PiSSA | OLoRA] to LoRA.

unload

( )

Gets back the base model by removing all the lora modules without merging. This gives back the original base model.

Utility

LoftQ

peft.replace_lora_weights_loftq

( peft_model model_path: Optional[str] = None adapter_name: str = 'default' callback: Optional[Callable[[torch.nn.Module, str], bool]] = None )

Parameters

peft_model (PeftModel) — The model to replace the weights of. Must be a quantized PEFT model with LoRA layers.
model_path (Optional[str]) — The path to the model safetensors file. If the model is a Hugging Face model, this will be inferred from the model’s config. Otherwise, it must be provided.
adapter_name (str) — The name of the adapter to replace the weights of. The default adapter name is “default”.
callback (Optional[Callable[[PeftModel, str], bool]]) — A callback function that will be called after each module is replaced. The callback function should take the model and the name of the current module as input and return a boolean indicating whether the replacement should be kept. If the callback returns False, the replacement will be rolled back. This can be very useful to confirm that the LoftQ initialization actually decreases the quantization error of the model. As an example, this callback could generate logits for given input and compare it with the logits from the original, non-quanitzed model with the same input, and only return True if there is an improvement. As this is a greedy optimization, it’s possible that calling this function multiple times yields incremental improvements.

Replace the LoRA weights of a model quantized with bitsandbytes, using the LoftQ technique.

The replacement is done on the fly by loading in the non-quantized weights from a locally stored safetensors model file and initializing the LoRA weights such that the quantization error between the original and quantized weights is minimized.

As lazy loading is not possible with pickle, normal PyTorch checkpoint files cannot be supported.

Depending on the model size, calling this function may take some time to finish.

Eva

EvaConfig

class peft.EvaConfig

( rho: float = 2.0 tau: float = 0.99 use_label_mask: bool = True label_mask_value: int = -100 whiten: bool = False adjust_scaling_factors: bool = True )

Parameters

rho (float) — Rho value for EVA redistribution (>= 1.0). The maximum rank for a layer is lora_r * rho. Default is 2.0, meaning the maximum rank allowed for a layer is 2r. Increasing rho will allow for a higher degree of redistribution of ranks across layers. Some pre-trained models might be more sensitive to a rank redistribution. It can therefore be beneficial to try rho=1.0 (no redistribution) if the performance is lower than expected.
tau (float) — Cosine similarity threshold for early stopping. Compares the cosine similarity of right-singular vectors between two consecutive SVD steps. If the cosine similarity is above this threshold, the SVD iteration is stopped. Default is 0.99.
use_label_mask (bool) — Use label mask for EVA initialization. This means that positions where labels=label_mask_value are ignored for the SVD computation. Setting use_label_mask=True is preferred in most cases and can be especially beneficial for multi-turn conversations. The default value is True. Filtering out items based on the label mask can sometimes lead to a small batch size and as a result instabilities in the SVD computation. For cases where a large share of batch items would be filtered out, set use_label_mask=False.
label_mask_value (int) — If use_label_mask=True the value to look for to mask out ignored tokens. Default is -100.
whiten (bool) — Apply whitening to singular vectors. Default is False. Whitening has been shown to be beneficial for EVA in the vision domain.
adjust_scaling_factors (bool) — Adjust LoRA scaling factors after the rank redistribution. Setting this to True means the scaling factors are adjusted so that all LoRA gradients have the same scale regardless of their rank. Default is True.

This is the sub-configuration class to store the configuration for a data-driven initialization via EVA. EVA was introduced in Explained Variance Adaptation.

initialize_lora_eva_weights

peft.initialize_lora_eva_weights

( model: Module dataloader: typing.Optional[typing.Iterable] = None eva_state_dict: typing.Optional[dict] = None forward_fn: typing.Optional[<built-in function callable>] = <function forward_fn_dict at 0x7fd36a5f5f30> prepare_model_inputs_fn: typing.Optional[<built-in function callable>] = <function prepare_model_inputs_fn_language_modeling at 0x7fd36a5f5e10> prepare_layer_inputs_fn: typing.Union[<built-in function callable>, typing.Dict[str, <built-in function callable>], NoneType] = <function prepare_layer_inputs_fn_language_modeling at 0x7fd36a5f5ea0> adapter_name: str = 'default' gather_distributed_inputs: bool = True show_progress_bar: bool = True ) → model (torch.nn.Module)

Parameters

model (PeftModel) — The peft model to compute the SVD for.
dataloader (Optional[Iterable]) — The dataloader to use for the forward pass. If None, eva_state_dict needs to be provided.
eva_state_dict (Optional[dict]) — The state_dict to load into the model. If None, a dataloader needs to be provided and the state_dict will be computed using get_eva_state_dict.
forward_fn (callable) — The forward function to use for the forward pass. Takes two arguments: model and inputs. Default behavior is return model(**inputs)
prepare_model_inputs_fn (Optional[callable]) — This function receives the model inputs and the peft_config and passes the output to prepare_layer_inputs_fn. Can be used to modify the input to the SVD computation based on the original model inputs. For example for language modeling the attention mask is used to determine which indices are padding tokens and should not be used for SVD. Any function defined here expects two arguments: model_input and peft_config. peft.tuners.lora.eva.prepare_model_inputs_fn_language_modeling is used by default.
prepare_layer_inputs_fn (Union[callable, Dict[str, callable], None]) — This function receives the layer inputs, the model inputs (potentially modified by prepare_model_inputs_fn) and the name of the layer and returns the inputs that should be used for SVD for that particular layer. Any custom function defined here expects three arguments: layer_input, model_input, and layer_name and should return a 2d tensor. The default logic can be found in peft.tuners.lora.eva.prepare_layer_inputs_fn_language_modeling and works for language modeling. In this case model_inputs is the mask used to determine which indices should be used for SVD (created by prepare_model_inputs_fn_language_modeling).
adapter_name (str) — The name of the adapter to initialize the weights for.
gather_distributed_inputs (bool) — Whether to gather the layer inputs from all ranks. Default is True meaning in a distributed setting the layer inputs will be gathered from all ranks for the SVD computation. For non-distributed settings this argument is ignored. Set to False if you are using a non-distributed dataloader in a distributed setting.
show_progress_bar (bool) — Whether to show a progress bar. Default is True.

Returns

model (torch.nn.Module)

The model with the initialized LoRA weights.

Initialize the weights of the LoRA layers using the EVA method.

This function initializes the weights of the LoRA layers using the EVA method. It computes the SVD for each adapter layer and updates the weights accordingly.

get_eva_state_dict

peft.get_eva_state_dict

( model: Module dataloader: typing.Iterable peft_config: typing.Optional[peft.tuners.lora.config.LoraConfig] = None forward_fn: typing.Optional[<built-in function callable>] = <function forward_fn_dict at 0x7fd36a5f5f30> prepare_model_inputs_fn: typing.Optional[<built-in function callable>] = <function prepare_model_inputs_fn_language_modeling at 0x7fd36a5f5e10> prepare_layer_inputs_fn: typing.Union[<built-in function callable>, typing.Dict[str, <built-in function callable>], NoneType] = <function prepare_layer_inputs_fn_language_modeling at 0x7fd36a5f5ea0> adapter_name: str = 'default' gather_distributed_inputs: bool = True show_progress_bar: bool = True ) → eva_state_dict (dict)

Parameters

model (torch.nn.Module) — The model to compute the SVD for. Does not need to be a PeftModel.
dataloader (Iterable) — The dataloader to use for the forward pass.
peft_config (Optional[LoraConfig]) — The configuration for the LoRA layers. Only required if model is not a PeftModel.
forward_fn (callable) — The forward function to use for the forward pass. Takes two arguments: model and inputs. Default behavior is return model(**inputs)
prepare_model_inputs_fn (Optional[callable]) — This function receives the model inputs and the peft_config and passes the output to prepare_layer_inputs_fn. Can be used to modify the input to the SVD computation based on the original model inputs. For example for language modeling the attention mask is used to determine which indices are padding tokens and should not be used for SVD. Any function defined here expects two arguments: model_input and peft_config. peft.tuners.lora.eva.prepare_model_inputs_fn_language_modeling is used by default.
prepare_layer_inputs_fn (Union[callable, Dict[str, callable], None]) — This function receives the layer inputs, the model inputs (potentially modified by prepare_model_inputs_fn) and the name of the layer and returns the inputs that should be used for SVD for that particular layer. Any custom function defined here expects three arguments: layer_input, model_input, and layer_name and should return a 2d tensor. The default logic can be found in peft.tuners.lora.eva.prepare_layer_inputs_fn_language_modeling and works for language modeling. In this case model_inputs is the mask used to determine which indices should be used for SVD (created by prepare_model_inputs_fn_language_modeling).
adapter_name (str) — The name of the adapter to compute the SVD for.
gather_distributed_inputs (bool) — Whether to gather the layer inputs from all ranks. Default is True meaning in a distributed setting the layer inputs will be gathered from all ranks for the SVD computation. For non-distributed settings this argument is ignored. Set to False if you are using a non-distributed dataloader in a distributed setting.
show_progress_bar (bool) — Whether to show a progress bar. Default is True.

Returns

eva_state_dict (dict)

The state dictionary containing the SVD components for each layer.

Compute the SVD for each layer in the model.

This function computes the Singular Value Decomposition (SVD) for each layer in the model. It uses the incremental PCA method to compute the SVD components. The function also checks for convergence of the computed components using cosine similarity. The rank distribution for each layer is determined based on the explained variance ratio.

< > Update on GitHub

←LoKr X-LoRA→