Optimizer

The .optimization module provides:

  • an optimizer with weight decay fixed that can be used to fine-tuned models, and

  • several schedules in the form of schedule objects that inherit from _LRSchedule:

  • a gradient accumulation class to accumulate the gradients of multiple batches

AdamW

class transformers.AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-06, weight_decay=0.0, correct_bias=True)[source]

Implements Adam algorithm with weight decay fix.

Parameters
  • lr (float) – learning rate. Default 1e-3.

  • betas (tuple of 2 floats) – Adams beta parameters (b1, b2). Default: (0.9, 0.999)

  • eps (float) – Adams epsilon. Default: 1e-6

  • weight_decay (float) – Weight decay. Default: 0.0

  • correct_bias (bool) – can be set to False to avoid correcting bias in Adam (e.g. like in Bert TF repository). Default True.

step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

AdamWeightDecay

class transformers.AdamWeightDecay(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False, weight_decay_rate=0.0, include_in_weight_decay=None, exclude_from_weight_decay=None, name='AdamWeightDecay', **kwargs)[source]

Adam enables L2 weight decay and clip_by_global_norm on gradients.

Just adding the square of the weights to the loss function is not the correct way of using L2 regularization/weight decay with Adam, since that will interact with the m and v parameters in strange ways.

Instead we want ot decay the weights in a manner that doesn’t interact with the m/v parameters. This is equivalent to adding the square of the weights to the loss with plain (non-momentum) SGD.

apply_gradients(grads_and_vars, clip_norm, name=None)[source]

Apply gradients to variables.

This is the second part of minimize(). It returns an Operation that applies gradients.

Parameters
  • grads_and_vars – List of (gradient, variable) pairs.

  • name – Optional name for the returned operation. Default to the name passed to the Optimizer constructor.

Returns

An Operation that applies the specified gradients. The iterations will be automatically increased by 1.

Raises
  • TypeError – If grads_and_vars is malformed.

  • ValueError – If none of the variables have gradients.

classmethod from_config(config)[source]

Creates an optimizer from its config with WarmUp custom object.

get_config()[source]

Returns the config of the optimimizer.

An optimizer config is a Python dictionary (serializable) containing the configuration of an optimizer. The same optimizer can be reinstantiated later (without any saved state) from this configuration.

Returns

Python dictionary.

transformers.create_optimizer(init_lr, num_train_steps, num_warmup_steps)[source]

Creates an optimizer with learning rate schedule.

Schedules

Learning Rate Schedules

transformers.get_constant_schedule(optimizer, last_epoch=-1)[source]

Create a schedule with a constant learning rate.

transformers.get_constant_schedule_with_warmup(optimizer, num_warmup_steps, last_epoch=-1)[source]

Create a schedule with a constant learning rate preceded by a warmup period during which the learning rate increases linearly between 0 and 1.

transformers.get_cosine_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, num_cycles=0.5, last_epoch=-1)[source]

Create a schedule with a learning rate that decreases following the values of the cosine function between 0 and pi * cycles after a warmup period during which it increases linearly between 0 and 1.

transformers.get_cosine_with_hard_restarts_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, num_cycles=1.0, last_epoch=-1)[source]

Create a schedule with a learning rate that decreases following the values of the cosine function with several hard restarts, after a warmup period during which it increases linearly between 0 and 1.

transformers.get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, last_epoch=-1)[source]

Create a schedule with a learning rate that decreases linearly after linearly increasing during a warmup period.

Warmup

class transformers.WarmUp(initial_learning_rate, decay_schedule_fn, warmup_steps, power=1.0, name=None)[source]

Applys a warmup schedule on a given learning rate decay schedule.

Gradient Strategies

GradientAccumulator

class transformers.GradientAccumulator[source]

Distribution strategies-aware gradient accumulation utility.