Bitsandbytes documentation

AdamW

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.45.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

AdamW

AdamW is a variant of the Adam optimizer that separates weight decay from the gradient update based on the observation that the weight decay formulation is different when applied to SGD and Adam.

bitsandbytes also supports paged optimizers which take advantage of CUDAs unified memory to transfer memory from the GPU to the CPU when GPU memory is exhausted.

AdamW

class bitsandbytes.optim.AdamW

< >

( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0.01 amsgrad = False optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True is_paged = False )

__init__

< >

( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0.01 amsgrad = False optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True is_paged = False )

Parameters

  • params (torch.tensor) — The input parameters to optimize.
  • lr (float, defaults to 1e-3) — The learning rate.
  • betas (tuple(float, float), defaults to (0.9, 0.999)) — The beta values are the decay rates of the first and second-order moment of the optimizer.
  • eps (float, defaults to 1e-8) — The epsilon value prevents division by zero in the optimizer.
  • weight_decay (float, defaults to 1e-2) — The weight decay value for the optimizer.
  • amsgrad (bool, defaults to False) — Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead.
  • optim_bits (int, defaults to 32) — The number of bits of the optimizer state.
  • args (object, defaults to None) — An object with additional arguments.
  • min_8bit_size (int, defaults to 4096) — The minimum number of elements of the parameter tensors for 8-bit optimization.
  • percentile_clipping (int, defaults to 100) — Adapts clipping threshold automatically by tracking the last 100 gradient norms and clipping the gradient at a certain percentile to improve stability.
  • block_wise (bool, defaults to True) — Whether to independently quantize each block of tensors to reduce outlier effects and improve stability.
  • is_paged (bool, defaults to False) — Whether the optimizer is a paged optimizer or not.

Base AdamW optimizer.

AdamW8bit

class bitsandbytes.optim.AdamW8bit

< >

( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0.01 amsgrad = False optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True is_paged = False )

__init__

< >

( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0.01 amsgrad = False optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True is_paged = False )

Parameters

  • params (torch.tensor) — The input parameters to optimize.
  • lr (float, defaults to 1e-3) — The learning rate.
  • betas (tuple(float, float), defaults to (0.9, 0.999)) — The beta values are the decay rates of the first and second-order moment of the optimizer.
  • eps (float, defaults to 1e-8) — The epsilon value prevents division by zero in the optimizer.
  • weight_decay (float, defaults to 1e-2) — The weight decay value for the optimizer.
  • amsgrad (bool, defaults to False) — Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead.
  • optim_bits (int, defaults to 32) — The number of bits of the optimizer state.
  • args (object, defaults to None) — An object with additional arguments.
  • min_8bit_size (int, defaults to 4096) — The minimum number of elements of the parameter tensors for 8-bit optimization.
  • percentile_clipping (int, defaults to 100) — Adapts clipping threshold automatically by tracking the last 100 gradient norms and clipping the gradient at a certain percentile to improve stability.
  • block_wise (bool, defaults to True) — Whether to independently quantize each block of tensors to reduce outlier effects and improve stability.
  • is_paged (bool, defaults to False) — Whether the optimizer is a paged optimizer or not.

8-bit AdamW optimizer.

AdamW32bit

class bitsandbytes.optim.AdamW32bit

< >

( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0.01 amsgrad = False optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True is_paged = False )

__init__

< >

( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0.01 amsgrad = False optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True is_paged = False )

Parameters

  • params (torch.tensor) — The input parameters to optimize.
  • lr (float, defaults to 1e-3) — The learning rate.
  • betas (tuple(float, float), defaults to (0.9, 0.999)) — The beta values are the decay rates of the first and second-order moment of the optimizer.
  • eps (float, defaults to 1e-8) — The epsilon value prevents division by zero in the optimizer.
  • weight_decay (float, defaults to 1e-2) — The weight decay value for the optimizer.
  • amsgrad (bool, defaults to False) — Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead.
  • optim_bits (int, defaults to 32) — The number of bits of the optimizer state.
  • args (object, defaults to None) — An object with additional arguments.
  • min_8bit_size (int, defaults to 4096) — The minimum number of elements of the parameter tensors for 8-bit optimization.
  • percentile_clipping (int, defaults to 100) — Adapts clipping threshold automatically by tracking the last 100 gradient norms and clipping the gradient at a certain percentile to improve stability.
  • block_wise (bool, defaults to True) — Whether to independently quantize each block of tensors to reduce outlier effects and improve stability.
  • is_paged (bool, defaults to False) — Whether the optimizer is a paged optimizer or not.

32-bit AdamW optimizer.

PagedAdamW

class bitsandbytes.optim.PagedAdamW

< >

( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0.01 amsgrad = False optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True )

__init__

< >

( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0.01 amsgrad = False optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True )

Parameters

  • params (torch.tensor) — The input parameters to optimize.
  • lr (float, defaults to 1e-3) — The learning rate.
  • betas (tuple(float, float), defaults to (0.9, 0.999)) — The beta values are the decay rates of the first and second-order moment of the optimizer.
  • eps (float, defaults to 1e-8) — The epsilon value prevents division by zero in the optimizer.
  • weight_decay (float, defaults to 1e-2) — The weight decay value for the optimizer.
  • amsgrad (bool, defaults to False) — Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead.
  • optim_bits (int, defaults to 32) — The number of bits of the optimizer state.
  • args (object, defaults to None) — An object with additional arguments.
  • min_8bit_size (int, defaults to 4096) — The minimum number of elements of the parameter tensors for 8-bit optimization.
  • percentile_clipping (int, defaults to 100) — Adapts clipping threshold automatically by tracking the last 100 gradient norms and clipping the gradient at a certain percentile to improve stability.
  • block_wise (bool, defaults to True) — Whether to independently quantize each block of tensors to reduce outlier effects and improve stability.
  • is_paged (bool, defaults to False) — Whether the optimizer is a paged optimizer or not.

Paged AdamW optimizer.

PagedAdamW8bit

class bitsandbytes.optim.PagedAdamW8bit

< >

( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0.01 amsgrad = False optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True )

__init__

< >

( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0.01 amsgrad = False optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True )

Parameters

  • params (torch.tensor) — The input parameters to optimize.
  • lr (float, defaults to 1e-3) — The learning rate.
  • betas (tuple(float, float), defaults to (0.9, 0.999)) — The beta values are the decay rates of the first and second-order moment of the optimizer.
  • eps (float, defaults to 1e-8) — The epsilon value prevents division by zero in the optimizer.
  • weight_decay (float, defaults to 1e-2) — The weight decay value for the optimizer.
  • amsgrad (bool, defaults to False) — Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead.
  • optim_bits (int, defaults to 32) — The number of bits of the optimizer state.
  • args (object, defaults to None) — An object with additional arguments.
  • min_8bit_size (int, defaults to 4096) — The minimum number of elements of the parameter tensors for 8-bit optimization.
  • percentile_clipping (int, defaults to 100) — Adapts clipping threshold automatically by tracking the last 100 gradient norms and clipping the gradient at a certain percentile to improve stability.
  • block_wise (bool, defaults to True) — Whether to independently quantize each block of tensors to reduce outlier effects and improve stability.
  • is_paged (bool, defaults to False) — Whether the optimizer is a paged optimizer or not.

Paged 8-bit AdamW optimizer.

PagedAdamW32bit

class bitsandbytes.optim.PagedAdamW32bit

< >

( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0.01 amsgrad = False optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True )

__init__

< >

( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0.01 amsgrad = False optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True )

Parameters

  • params (torch.tensor) — The input parameters to optimize.
  • lr (float, defaults to 1e-3) — The learning rate.
  • betas (tuple(float, float), defaults to (0.9, 0.999)) — The beta values are the decay rates of the first and second-order moment of the optimizer.
  • eps (float, defaults to 1e-8) — The epsilon value prevents division by zero in the optimizer.
  • weight_decay (float, defaults to 1e-2) — The weight decay value for the optimizer.
  • amsgrad (bool, defaults to False) — Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead.
  • optim_bits (int, defaults to 32) — The number of bits of the optimizer state.
  • args (object, defaults to None) — An object with additional arguments.
  • min_8bit_size (int, defaults to 4096) — The minimum number of elements of the parameter tensors for 8-bit optimization.
  • percentile_clipping (int, defaults to 100) — Adapts clipping threshold automatically by tracking the last 100 gradient norms and clipping the gradient at a certain percentile to improve stability.
  • block_wise (bool, defaults to True) — Whether to independently quantize each block of tensors to reduce outlier effects and improve stability.
  • is_paged (bool, defaults to False) — Whether the optimizer is a paged optimizer or not.

Paged 32-bit AdamW optimizer.

< > Update on GitHub