Adam

Adam (Adaptive moment estimation) is an adaptive learning rate optimizer, combining ideas from SGD with momentum and RMSprop to automatically scale the learning rate:

a weighted average of the past gradients to provide direction (first-moment)
a weighted average of the squared past gradients to adapt the learning rate to each parameter (second-moment)

bitsandbytes also supports paged optimizers which take advantage of CUDAs unified memory to transfer memory from the GPU to the CPU when GPU memory is exhausted.

Adam

class bitsandbytes.optim.Adam

< source >

( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0 amsgrad = False optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True is_paged = False )

init

< source >

( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0 amsgrad = False optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True is_paged = False )

Parameters

params (torch.tensor) — The input parameters to optimize.
lr (float, defaults to 1e-3) — The learning rate.
betas (tuple(float, float), defaults to (0.9, 0.999)) — The beta values are the decay rates of the first and second-order moment of the optimizer.
eps (float, defaults to 1e-8) — The epsilon value prevents division by zero in the optimizer.
weight_decay (float, defaults to 0.0) — The weight decay value for the optimizer.
amsgrad (bool, defaults to False) — Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead.
optim_bits (int, defaults to 32) — The number of bits of the optimizer state.
args (dict, defaults to None) — A dictionary with additional arguments.
min_8bit_size (int, defaults to 4096) — The minimum number of elements of the parameter tensors for 8-bit optimization.
percentile_clipping (int, defaults to 100) — Adapts clipping threshold automatically by tracking the last 100 gradient norms and clipping the gradient at a certain percentile to improve stability.
block_wise (bool, defaults to True) — Whether to independently quantize each block of tensors to reduce outlier effects and improve stability.
is_paged (bool, defaults to False) — Whether the optimizer is a paged optimizer or not.

Base Adam optimizer.

Adam8bit

class bitsandbytes.optim.Adam8bit

< source >

( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0 amsgrad = False optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True is_paged = False )

init

< source >

( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0 amsgrad = False optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True is_paged = False )

Parameters

params (torch.tensor) — The input parameters to optimize.
lr (float, defaults to 1e-3) — The learning rate.
betas (tuple(float, float), defaults to (0.9, 0.999)) — The beta values are the decay rates of the first and second-order moment of the optimizer.
eps (float, defaults to 1e-8) — The epsilon value prevents division by zero in the optimizer.
weight_decay (float, defaults to 0.0) — The weight decay value for the optimizer.
amsgrad (bool, defaults to False) — Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead.
optim_bits (int, defaults to 32) — The number of bits of the optimizer state.
args (dict, defaults to None) — A dictionary with additional arguments.
min_8bit_size (int, defaults to 4096) — The minimum number of elements of the parameter tensors for 8-bit optimization.
percentile_clipping (int, defaults to 100) — Adapts clipping threshold automatically by tracking the last 100 gradient norms and clipping the gradient at a certain percentile to improve stability.
block_wise (bool, defaults to True) — Whether to independently quantize each block of tensors to reduce outlier effects and improve stability.
is_paged (bool, defaults to False) — Whether the optimizer is a paged optimizer or not.

8-bit Adam optimizer.

Adam32bit

class bitsandbytes.optim.Adam32bit

< source >

( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0 amsgrad = False optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True is_paged = False )

init

< source >

( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0 amsgrad = False optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True is_paged = False )

Parameters

params (torch.tensor) — The input parameters to optimize.
lr (float, defaults to 1e-3) — The learning rate.
betas (tuple(float, float), defaults to (0.9, 0.999)) — The beta values are the decay rates of the first and second-order moment of the optimizer.
eps (float, defaults to 1e-8) — The epsilon value prevents division by zero in the optimizer.
weight_decay (float, defaults to 0.0) — The weight decay value for the optimizer.
amsgrad (bool, defaults to False) — Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead.
optim_bits (int, defaults to 32) — The number of bits of the optimizer state.
args (dict, defaults to None) — A dictionary with additional arguments.
min_8bit_size (int, defaults to 4096) — The minimum number of elements of the parameter tensors for 8-bit optimization.
percentile_clipping (int, defaults to 100) — Adapts clipping threshold automatically by tracking the last 100 gradient norms and clipping the gradient at a certain percentile to improve stability.
block_wise (bool, defaults to True) — Whether to independently quantize each block of tensors to reduce outlier effects and improve stability.
is_paged (bool, defaults to False) — Whether the optimizer is a paged optimizer or not.

32-bit Adam optimizer.

PagedAdam

class bitsandbytes.optim.PagedAdam

< source >

( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0 amsgrad = False optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True is_paged = False )

init

< source >

( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0 amsgrad = False optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True is_paged = False )

Parameters

params (torch.tensor) — The input parameters to optimize.
lr (float, defaults to 1e-3) — The learning rate.
betas (tuple(float, float), defaults to (0.9, 0.999)) — The beta values are the decay rates of the first and second-order moment of the optimizer.
eps (float, defaults to 1e-8) — The epsilon value prevents division by zero in the optimizer.
weight_decay (float, defaults to 0.0) — The weight decay value for the optimizer.
amsgrad (bool, defaults to False) — Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead.
optim_bits (int, defaults to 32) — The number of bits of the optimizer state.
args (dict, defaults to None) — A dictionary with additional arguments.
min_8bit_size (int, defaults to 4096) — The minimum number of elements of the parameter tensors for 8-bit optimization.
percentile_clipping (int, defaults to 100) — Adapts clipping threshold automatically by tracking the last 100 gradient norms and clipping the gradient at a certain percentile to improve stability.
block_wise (bool, defaults to True) — Whether to independently quantize each block of tensors to reduce outlier effects and improve stability.
is_paged (bool, defaults to False) — Whether the optimizer is a paged optimizer or not.

Paged Adam optimizer.

PagedAdam8bit

class bitsandbytes.optim.PagedAdam8bit

< source >

( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0 amsgrad = False optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True is_paged = False )

init

< source >

( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0 amsgrad = False optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True is_paged = False )

Parameters

params (torch.tensor) — The input parameters to optimize.
lr (float, defaults to 1e-3) — The learning rate.
betas (tuple(float, float), defaults to (0.9, 0.999)) — The beta values are the decay rates of the first and second-order moment of the optimizer.
eps (float, defaults to 1e-8) — The epsilon value prevents division by zero in the optimizer.
weight_decay (float, defaults to 0.0) — The weight decay value for the optimizer.
amsgrad (bool, defaults to False) — Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead.
optim_bits (int, defaults to 32) — The number of bits of the optimizer state.
args (dict, defaults to None) — A dictionary with additional arguments.
min_8bit_size (int, defaults to 4096) — The minimum number of elements of the parameter tensors for 8-bit optimization.
percentile_clipping (int, defaults to 100) — Adapts clipping threshold automatically by tracking the last 100 gradient norms and clipping the gradient at a certain percentile to improve stability.
block_wise (bool, defaults to True) — Whether to independently quantize each block of tensors to reduce outlier effects and improve stability.
is_paged (bool, defaults to False) — Whether the optimizer is a paged optimizer or not.

8-bit paged Adam optimizer.

PagedAdam32bit

class bitsandbytes.optim.PagedAdam32bit

< source >

( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0 amsgrad = False optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True is_paged = False )

init

< source >

( params lr = 0.001 betas = (0.9, 0.999) eps = 1e-08 weight_decay = 0 amsgrad = False optim_bits = 32 args = None min_8bit_size = 4096 percentile_clipping = 100 block_wise = True is_paged = False )

Parameters

params (torch.tensor) — The input parameters to optimize.
lr (float, defaults to 1e-3) — The learning rate.
betas (tuple(float, float), defaults to (0.9, 0.999)) — The beta values are the decay rates of the first and second-order moment of the optimizer.
eps (float, defaults to 1e-8) — The epsilon value prevents division by zero in the optimizer.
weight_decay (float, defaults to 0.0) — The weight decay value for the optimizer.
amsgrad (bool, defaults to False) — Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead.
optim_bits (int, defaults to 32) — The number of bits of the optimizer state.
args (dict, defaults to None) — A dictionary with additional arguments.
min_8bit_size (int, defaults to 4096) — The minimum number of elements of the parameter tensors for 8-bit optimization.
percentile_clipping (int, defaults to 100) — Adapts clipping threshold automatically by tracking the last 100 gradient norms and clipping the gradient at a certain percentile to improve stability.
block_wise (bool, defaults to True) — Whether to independently quantize each block of tensors to reduce outlier effects and improve stability.
is_paged (bool, defaults to False) — Whether the optimizer is a paged optimizer or not.

Paged 32-bit Adam optimizer.

Bitsandbytes

Adam

Adam

class bitsandbytes.optim.Adam

__init__

Adam8bit

class bitsandbytes.optim.Adam8bit

__init__

Adam32bit

class bitsandbytes.optim.Adam32bit

__init__

PagedAdam

class bitsandbytes.optim.PagedAdam

__init__

PagedAdam8bit

class bitsandbytes.optim.PagedAdam8bit

__init__

PagedAdam32bit

class bitsandbytes.optim.PagedAdam32bit

__init__

init

init

init

init

init

init