Optimization

This page contains the API reference documentation for learning rate optimizers included in timm.

Optimizers

Factory functions

timm.optim.create_optimizer

< source >

( args model filter_bias_and_bn = True )

Legacy optimizer factory for backwards compatibility. NOTE: Use create_optimizer_v2 for new code.

timm.optim.create_optimizer_v2

< source >

( model_or_params opt: str = 'sgd' lr: Optional = None weight_decay: float = 0.0 momentum: float = 0.9 foreach: Optional = None filter_bias_and_bn: bool = True layer_decay: Optional = None param_group_fn: Optional = None **kwargs )

Parameters

model_or_params (nn.Module) — model containing parameters to optimize opt — name of optimizer to create lr — initial learning rate weight_decay — weight decay to apply in optimizer momentum — momentum for momentum based optimizers (others may use betas via kwargs) foreach — Enable / disable foreach (multi-tensor) operation if True / False. Choose safe default if None filter_bias_and_bn — filter out bias, bn and other 1d params from weight decay **kwargs — extra optimizer specific kwargs to pass through

Create an optimizer.

TODO currently the model is passed in and all parameters are selected for optimization. For more general use an interface that allows selection of parameters to optimize and lr groups, one of:

a filter fn interface that further breaks params into groups in a weight_decay compatible fashion
expose the parameters interface and leave it up to caller

Optimizer Classes

class timm.optim.AdaBelief

< source >

( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-16 weight_decay = 0 amsgrad = False decoupled_decay = True fixed_decay = False rectify = True degenerated_to_sgd = True )

Parameters

params (iterable) — iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) — learning rate (default: 1e-3)
betas (Tuple[float, float], optional) — coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
eps (float, optional) — term added to the denominator to improve numerical stability (default: 1e-16)
weight_decay (float, optional) — weight decay (L2 penalty) (default: 0)
amsgrad (boolean, optional) — whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond_ (default: False)
decoupled_decay (boolean, optional) — (default: True) If set as True, then the optimizer uses decoupled weight decay as in AdamW
fixed_decay (boolean, optional) — (default: False) This is used when weightdecouple is set as True. When fixed_decay == True, the weight decay is performed as $W{new} = W{old} - W{old} \times decay$. When fixeddecay == False, the weight decay is performed as $W{new} = W{old} - W{old} \times decay \times lr$. Note that in this case, the weight decay ratio decreases with learning rate (lr).
rectify (boolean, optional) — (default: True) If set as True, then perform the rectified update similar to RAdam
degenerated_to_sgd (boolean, optional) (default —True) If set as True, then perform SGD update when variance of gradient is high

Implements AdaBelief algorithm. Modified from Adam in PyTorch

reference: AdaBelief Optimizer, adapting stepsizes by the belief in observed gradients, NeurIPS 2020

For a complete table of recommended hyperparameters, see https://github.com/juntang-zhuang/Adabelief-Optimizer’ For example train/args for EfficientNet see these gists

link to train_scipt: https://gist.github.com/juntang-zhuang/0a501dd51c02278d952cf159bc233037
link to args.yaml: https://gist.github.com/juntang-zhuang/517ce3c27022b908bb93f78e4f786dc3

step

< source >

( closure = None )

Parameters

closure (callable, optional) — A closure that reevaluates the model and returns the loss.

Performs a single optimization step.

class timm.optim.Adafactor

< source >

( params lr = None eps = 1e-30 eps_scale = 0.001 clip_threshold = 1.0 decay_rate = -0.8 betas = None weight_decay = 0.0 scale_parameter = True warmup_init = False )

Parameters

params (iterable) — iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) — external learning rate (default: None)
eps (tuple[float, float]) — regularization constants for square gradient and parameter scale respectively (default: (1e-30, 1e-3))
clip_threshold (float) — threshold of root mean square of final gradient update (default: 1.0)
decay_rate (float) — coefficient used to compute running averages of square gradient (default: -0.8)
beta1 (float) — coefficient used for computing running averages of gradient (default: None)
weight_decay (float, optional) — weight decay (L2 penalty) (default: 0)
scale_parameter (bool) — if True, learning rate is scaled by root mean square of parameter (default: True)
warmup_init (bool) — time-dependent learning rate computation depends on whether warm-up initialization is being used (default: False)

Implements Adafactor algorithm. This implementation is based on: Adafactor: Adaptive Learning Rates with Sublinear Memory Cost (see https://arxiv.org/abs/1804.04235)

Note that this optimizer internally adjusts the learning rate depending on the scale_parameter, relative_step and warmup_init options.

To use a manual (external) learning rate schedule you should set scale_parameter=False and relative_step=False.

step

< source >

( closure = None )

Parameters

closure (callable, optional) — A closure that reevaluates the model and returns the loss.

Performs a single optimization step.

class timm.optim.Adahessian

< source >

( params lr = 0.1 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0.0 hessian_power = 1.0 update_each = 1 n_samples = 1 avg_conv_kernel = False )

Parameters

params (iterable) — iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) — learning rate (default: 0.1)
betas ((float, float), optional) — coefficients used for computing running averages of gradient and the squared hessian trace (default: (0.9, 0.999))
eps (float, optional) — term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) — weight decay (L2 penalty) (default: 0.0)
hessian_power (float, optional) — exponent of the hessian trace (default: 1.0)
update_each (int, optional) — compute the hessian trace approximation only after this number of steps (to save time) (default: 1)
n_samples (int, optional) — how many times to sample z for the approximation of the hessian trace (default: 1)

Implements the AdaHessian algorithm from “ADAHESSIAN: An Adaptive Second OrderOptimizer for Machine Learning”

get_params

< source >

( )

Gets all parameters in all param_groups with gradients

set_hessian

< source >

( )

Computes the Hutchinson approximation of the hessian trace and accumulates it for each trainable parameter.

step

< source >

( closure = None )

Parameters

closure (callable, optional) — a closure that reevaluates the model and returns the loss (default — None)

Performs a single optimization step.

zero_hessian

< source >

( )

Zeros out the accumalated hessian traces.

class timm.optim.AdamP

< source >

( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0 delta = 0.1 wd_ratio = 0.1 nesterov = False )

class timm.optim.AdamW

< source >

( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0.01 amsgrad = False )

Parameters

params (iterable) — iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) — learning rate (default: 1e-3)
betas (Tuple[float, float], optional) — coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
eps (float, optional) — term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) — weight decay coefficient (default: 1e-2)
amsgrad (boolean, optional) — whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond_ (default: False)

Implements AdamW algorithm.

The original Adam algorithm was proposed in Adam: A Method for Stochastic Optimization. The AdamW variant was proposed in Decoupled Weight Decay Regularization.

.. _Adam\: A Method for Stochastic Optimization: https://arxiv.org/abs/1412.6980 .. _Decoupled Weight Decay Regularization: https://arxiv.org/abs/1711.05101 .. _On the Convergence of Adam and Beyond: https://openreview.net/forum?id=ryQu7f-RZ

step

< source >

( closure = None )

Parameters

closure (callable, optional) — A closure that reevaluates the model and returns the loss.

Performs a single optimization step.

class timm.optim.Lamb

< source >

( params lr = 0.001 bias_correction = True betas = (0.9, 0.999) eps = 1e-06 weight_decay = 0.01 grad_averaging = True max_grad_norm = 1.0 trust_clip = False always_adapt = False )

Parameters

params (iterable) — iterable of parameters to optimize or dicts defining parameter groups.
lr (float, optional) — learning rate. (default: 1e-3)
betas (Tuple[float, float], optional) — coefficients used for computing running averages of gradient and its norm. (default: (0.9, 0.999))
eps (float, optional) — term added to the denominator to improve numerical stability. (default: 1e-8)
weight_decay (float, optional) — weight decay (L2 penalty) (default: 0)
grad_averaging (bool, optional) — whether apply (1-beta2) to grad when calculating running averages of gradient. (default: True)
max_grad_norm (float, optional) — value used to clip global grad norm (default: 1.0)
trust_clip (bool) — enable LAMBC trust ratio clipping (default: False)
always_adapt (boolean, optional) — Apply adaptive learning rate to 0.0 weight decay parameter (default: False)

Implements a pure pytorch variant of FuseLAMB (NvLamb variant) optimizer from apex.optimizers.FusedLAMB reference: https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/Transformer-XL/pytorch/lamb.py

LAMB was proposed in Large Batch Optimization for Deep Learning: Training BERT in 76 minutes_.

.. _Large Batch Optimization for Deep Learning - Training BERT in 76 minutes: https://arxiv.org/abs/1904.00962 .. _On the Convergence of Adam and Beyond: https://openreview.net/forum?id=ryQu7f-RZ

step

< source >

( closure = None )

Parameters

closure (callable, optional) — A closure that reevaluates the model and returns the loss.

Performs a single optimization step.

class timm.optim.Lars

< source >

( params lr = 1.0 momentum = 0 dampening = 0 weight_decay = 0 nesterov = False trust_coeff = 0.001 eps = 1e-08 trust_clip = False always_adapt = False )

Parameters

params (iterable) — iterable of parameters to optimize or dicts defining parameter groups.
lr (float, optional) — learning rate (default: 1.0).
momentum (float, optional) — momentum factor (default: 0)
weight_decay (float, optional) — weight decay (L2 penalty) (default: 0)
dampening (float, optional) — dampening for momentum (default: 0)
nesterov (bool, optional) — enables Nesterov momentum (default: False)
trust_coeff (float) — trust coefficient for computing adaptive lr / trust_ratio (default: 0.001)
eps (float) — eps for division denominator (default: 1e-8)
trust_clip (bool) — enable LARC trust ratio clipping (default: False)
always_adapt (bool) — always apply LARS LR adapt, otherwise only when group weight_decay != 0 (default: False)

LARS for PyTorch

Paper: Large batch training of Convolutional Networks - https://arxiv.org/pdf/1708.03888.pdf

step

< source >

( closure = None )

Parameters

closure (callable, optional) — A closure that reevaluates the model and returns the loss.

Performs a single optimization step.

class timm.optim.Lookahead

< source >

( base_optimizer alpha = 0.5 k = 6 )

class timm.optim.MADGRAD

< source >

( params: Any lr: float = 0.01 momentum: float = 0.9 weight_decay: float = 0 eps: float = 1e-06 decoupled_decay: bool = False )

Parameters

params (iterable) — Iterable of parameters to optimize or dicts defining parameter groups.
lr (float) — Learning rate (default: 1e-2).
momentum (float) — Momentum value in the range [0,1) (default: 0.9).
weight_decay (float) — Weight decay, i.e. a L2 penalty (default: 0).
eps (float) — Term added to the denominator outside of the root operation to improve numerical stability. (default: 1e-6).

MADGRAD_: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization.

.. _MADGRAD: https://arxiv.org/abs/2101.11075

MADGRAD is a general purpose optimizer that can be used in place of SGD or Adam may converge faster and generalize better. Currently GPU-only. Typically, the same learning rate schedule that is used for SGD or Adam may be used. The overall learning rate is not comparable to either method and should be determined by a hyper-parameter sweep.

MADGRAD requires less weight decay than other methods, often as little as zero. Momentum values used for SGD or Adam’s beta1 should work here also.

On sparse problems both weight_decay and momentum should be set to 0.

step

< source >

( closure: Optional = None )

Parameters

closure (callable, optional) — A closure that reevaluates the model and returns the loss.

Performs a single optimization step.

class timm.optim.Nadam

< source >

( params lr = 0.002 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0 schedule_decay = 0.004 )

Parameters

params (iterable) — iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) — learning rate (default: 2e-3)
betas (Tuple[float, float], optional) — coefficients used for computing running averages of gradient and its square
eps (float, optional) — term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) — weight decay (L2 penalty) (default: 0)
schedule_decay (float, optional) — momentum schedule decay (default: 4e-3)

Implements Nadam algorithm (a variant of Adam based on Nesterov momentum).

It has been proposed in Incorporating Nesterov Momentum into Adam__.

http://cs229.stanford.edu/proj2015/054_report.pdf http://www.cs.toronto.edu/~fritz/absps/momentum.pdf

Originally taken from: https://github.com/pytorch/pytorch/pull/1408 NOTE: Has potential issues but does work well on some problems.

step

< source >

( closure = None )

Parameters

closure (callable, optional) — A closure that reevaluates the model and returns the loss.

Performs a single optimization step.

class timm.optim.NvNovoGrad

< source >

( params lr = 0.001 betas = (0.95, 0.98) eps = 1e-08 weight_decay = 0 grad_averaging = False amsgrad = False )

Parameters

params (iterable) — iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) — learning rate (default: 1e-3)
betas (Tuple[float, float], optional) — coefficients used for computing running averages of gradient and its square (default: (0.95, 0.98))
eps (float, optional) — term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) — weight decay (L2 penalty) (default: 0) grad_averaging — gradient averaging
amsgrad (boolean, optional) — whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond_ (default: False)

Implements Novograd algorithm.

step

< source >

( closure = None )

Parameters

closure (callable, optional) — A closure that reevaluates the model
and returns the loss. —

Performs a single optimization step.

class timm.optim.RAdam

< source >

( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0 )

class timm.optim.RMSpropTF

< source >

( params lr = 0.01 alpha = 0.9 eps = 1e-10 weight_decay = 0 momentum = 0.0 centered = False decoupled_decay = False lr_in_momentum = True )

Parameters

params (iterable) — iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) — learning rate (default: 1e-2)
momentum (float, optional) — momentum factor (default: 0)
alpha (float, optional) — smoothing (decay) constant (default: 0.9)
eps (float, optional) — term added to the denominator to improve numerical stability (default: 1e-10)
centered (bool, optional) — if True, compute the centered RMSProp, the gradient is normalized by an estimation of its variance
weight_decay (float, optional) — weight decay (L2 penalty) (default: 0)
decoupled_decay (bool, optional) — decoupled weight decay as per https://arxiv.org/abs/1711.05101
lr_in_momentum (bool, optional) — learning rate scaling is included in the momentum buffer update as per defaults in Tensorflow

Implements RMSprop algorithm (TensorFlow style epsilon)

NOTE: This is a direct cut-and-paste of PyTorch RMSprop with eps applied before sqrt and a few other modifications to closer match Tensorflow for matching hyper-params.

Noteworthy changes include:

Epsilon applied inside square-root
square_avg initialized to ones
LR scaling of update accumulated in momentum buffer

Proposed by G. Hinton in his course.

The centered version first appears in Generating Sequences With Recurrent Neural Networks.

step

< source >

( closure = None )

Parameters

closure (callable, optional) — A closure that reevaluates the model and returns the loss.

Performs a single optimization step.

class timm.optim.SGDP

< source >

( params lr = <required parameter> momentum = 0 dampening = 0 weight_decay = 0 nesterov = False eps = 1e-08 delta = 0.1 wd_ratio = 0.1 )