transformers documentation

Optimization

Optimization

The .optimization module provides:

  • an optimizer with weight decay fixed that can be used to fine-tuned models, and
  • several schedules in the form of schedule objects that inherit from _LRSchedule:
  • a gradient accumulation class to accumulate the gradients of multiple batches

AdamW (PyTorch)

class transformers.AdamW < > expand 

( params: typing.Iterable[torch.nn.parameter.Parameter] lr: float = 0.001 betas: typing.Tuple[float, float] = (0.9, 0.999) eps: float = 1e-06 weight_decay: float = 0.0 correct_bias: bool = True )

Implements Adam algorithm with weight decay fix as introduced in Decoupled Weight Decay Regularization.

step < > expand 

( closure: typing.Callable = None )

Performs a single optimization step.

AdaFactor (PyTorch)

class transformers.Adafactor < > expand 

( params lr = None eps = (1e-30, 0.001) clip_threshold = 1.0 decay_rate = -0.8 beta1 = None weight_decay = 0.0 scale_parameter = True relative_step = True warmup_init = False )

AdaFactor pytorch implementation can be used as a drop in replacement for Adam original fairseq code: https://github.com/pytorch/fairseq/blob/master/fairseq/optim/adafactor.py

Paper: Adafactor: Adaptive Learning Rates with Sublinear Memory Cost https://arxiv.org/abs/1804.04235 Note that this optimizer internally adjusts the learning rate depending on the scale_parameter, relative_step and warmup_init options. To use a manual (external) learning rate schedule you should set scale_parameter=False and relative_step=False.

This implementation handles low-precision (FP16, bfloat) values, but we have not thoroughly tested.

Recommended T5 finetuning settings (https://discuss.huggingface.co/t/t5-finetuning-tips/684/3):

  • Training without LR warmup or clip_threshold is not recommended.

  • Disable relative updates

  • Use scale_parameter=False

  • Additional optimizer operations like gradient clipping should not be used alongside Adafactor

Example:

Adafactor(model.parameters(), scale_parameter=False, relative_step=False, warmup_init=False, lr=1e-3)

Others reported the following combination to work well:

Adafactor(model.parameters(), scale_parameter=True, relative_step=True, warmup_init=True, lr=None)

When using lr=None with Trainer you will most likely need to use AdafactorSchedule scheduler as following:

from transformers.optimization import Adafactor, AdafactorSchedule
optimizer = Adafactor(model.parameters(), scale_parameter=True, relative_step=True, warmup_init=True, lr=None)
lr_scheduler = AdafactorSchedule(optimizer)
trainer = Trainer(..., optimizers=(optimizer, lr_scheduler))

Usage:

# replace AdamW with Adafactor
optimizer = Adafactor(
    model.parameters(),
    lr=1e-3,
    eps=(1e-30, 1e-3),
    clip_threshold=1.0,
    decay_rate=-0.8,
    beta1=None,
    weight_decay=0.0,
    relative_step=False,
    scale_parameter=False,
    warmup_init=False
)
step < > expand 

( closure = None )

Performs a single optimization step

AdamWeightDecay (TensorFlow)

class transformers.AdamWeightDecay < > expand 

( learning_rate: typing.Union[float, keras.optimizer_v2.learning_rate_schedule.LearningRateSchedule] = 0.001 beta_1: float = 0.9 beta_2: float = 0.999 epsilon: float = 1e-07 amsgrad: bool = False weight_decay_rate: float = 0.0 include_in_weight_decay: typing.Optional[typing.List[str]] = None exclude_from_weight_decay: typing.Optional[typing.List[str]] = None name: str = 'AdamWeightDecay' **kwargs )

Adam enables L2 weight decay and clip_by_global_norm on gradients. Just adding the square of the weights to the loss function is not the correct way of using L2 regularization/weight decay with Adam, since that will interact with the m and v parameters in strange ways as shown in Decoupled Weight Decay Regularization.

Instead we want ot decay the weights in a manner that doesn’t interact with the m/v parameters. This is equivalent to adding the square of the weights to the loss with plain (non-momentum) SGD.

from_config < > expand 

( config )

Creates an optimizer from its config with WarmUp custom object.

transformers.create_optimizer < > expand 

( init_lr: float num_train_steps: int num_warmup_steps: int min_lr_ratio: float = 0.0 adam_beta1: float = 0.9 adam_beta2: float = 0.999 adam_epsilon: float = 1e-08 weight_decay_rate: float = 0.0 power: float = 1.0 include_in_weight_decay: typing.Optional[typing.List[str]] = None )

Creates an optimizer with a learning rate schedule using a warmup phase followed by a linear decay.

Schedules

Learning Rate Schedules (Pytorch)

class transformers.SchedulerType < > expand 

( value names = None module = None qualname = None type = None start = 1 )

An enumeration.

transformers.get_scheduler < > expand 

( name: typing.Union[str, transformers.trainer_utils.SchedulerType] optimizer: Optimizer num_warmup_steps: typing.Optional[int] = None num_training_steps: typing.Optional[int] = None )

Unified API to get any scheduler from its name.

transformers.get_constant_schedule < > expand 

( optimizer: Optimizer last_epoch: int = -1 )

Create a schedule with a constant learning rate, using the learning rate set in optimizer.

transformers.get_constant_schedule_with_warmup < > expand 

( optimizer: Optimizer num_warmup_steps: int last_epoch: int = -1 )

Create a schedule with a constant learning rate preceded by a warmup period during which the learning rate increases linearly between 0 and the initial lr set in the optimizer.

transformers.get_cosine_schedule_with_warmup < > expand 

( optimizer: Optimizer num_warmup_steps: int num_training_steps: int num_cycles: float = 0.5 last_epoch: int = -1 )

Create a schedule with a learning rate that decreases following the values of the cosine function between the initial lr set in the optimizer to 0, after a warmup period during which it increases linearly between 0 and the initial lr set in the optimizer.

transformers.get_cosine_with_hard_restarts_schedule_with_warmup < > expand 

( optimizer: Optimizer num_warmup_steps: int num_training_steps: int num_cycles: int = 1 last_epoch: int = -1 )

Create a schedule with a learning rate that decreases following the values of the cosine function between the initial lr set in the optimizer to 0, with several hard restarts, after a warmup period during which it increases linearly between 0 and the initial lr set in the optimizer.

transformers.get_linear_schedule_with_warmup < > expand 

( optimizer num_warmup_steps num_training_steps last_epoch = -1 )

Create a schedule with a learning rate that decreases linearly from the initial lr set in the optimizer to 0, after a warmup period during which it increases linearly from 0 to the initial lr set in the optimizer.

transformers.get_polynomial_decay_schedule_with_warmup < > expand 

( optimizer num_warmup_steps num_training_steps lr_end = 1e-07 power = 1.0 last_epoch = -1 )

Create a schedule with a learning rate that decreases as a polynomial decay from the initial lr set in the optimizer to end lr defined by lr_end, after a warmup period during which it increases linearly from 0 to the initial lr set in the optimizer.

Note: power defaults to 1.0 as in the fairseq implementation, which in turn is based on the original BERT implementation at https://github.com/google-research/bert/blob/f39e881b169b9d53bea03d2d341b31707a6c052b/optimization.py#L37

Warmup (TensorFlow)

class transformers.WarmUp < > expand 

( initial_learning_rate: float decay_schedule_fn: typing.Callable warmup_steps: int power: float = 1.0 name: str = None )

Applies a warmup schedule on a given learning rate decay schedule.

Gradient Strategies

GradientAccumulator (TensorFlow)

class transformers.GradientAccumulator < > expand 

( )

Gradient accumulation utility. When used with a distribution strategy, the accumulator should be called in a replica context. Gradients will be accumulated locally on each replica and without synchronization. Users should then call .gradients, scale the gradients if required, and pass the result to apply_gradients.

reset < > expand 

( )

Resets the accumulated gradients on the current replica.