Megatron-LM utilities
MegatronLMPlugin
class accelerate.utils.MegatronLMPlugin
< source >( tp_degree: int = None pp_degree: int = None num_micro_batches: int = None gradient_clipping: float = None sequence_parallelism: bool = None recompute_activations: bool = None use_distributed_optimizer: bool = None pipeline_model_parallel_split_rank: int = None num_layers_per_virtual_pipeline_stage: int = None is_train_batch_min: str = True train_iters: int = None train_samples: int = None weight_decay_incr_style: str = 'constant' start_weight_decay: float = None end_weight_decay: float = None lr_decay_style: str = 'linear' lr_decay_iters: int = None lr_decay_samples: int = None lr_warmup_iters: int = None lr_warmup_samples: int = None lr_warmup_fraction: float = None min_lr: float = 0 consumed_samples: typing.List[int] = None no_wd_decay_cond: typing.Optional[typing.Callable] = None scale_lr_cond: typing.Optional[typing.Callable] = None lr_mult: float = 1.0 megatron_dataset_flag: bool = False seq_length: int = None encoder_seq_length: int = None decoder_seq_length: int = None tensorboard_dir: str = None set_all_logging_options: bool = False eval_iters: int = 100 eval_interval: int = 1000 return_logits: bool = False custom_train_step_class: typing.Optional[typing.Any] = None custom_train_step_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None custom_model_provider_function: typing.Optional[typing.Callable] = None custom_prepare_model_function: typing.Optional[typing.Callable] = None custom_megatron_datasets_provider_function: typing.Optional[typing.Callable] = None custom_get_batch_function: typing.Optional[typing.Callable] = None custom_loss_function: typing.Optional[typing.Callable] = None other_megatron_args: typing.Optional[typing.Dict[str, typing.Any]] = None )
Parameters
- tp_degree (
int
, defaults toNone
) — Tensor parallelism degree. - pp_degree (
int
, defaults toNone
) — Pipeline parallelism degree. - num_micro_batches (
int
, defaults toNone
) — Number of micro-batches. - gradient_clipping (
float
, defaults toNone
) — Gradient clipping value based on global L2 Norm (0 to disable). - sequence_parallelism (
bool
, defaults toNone
) — Enable sequence parallelism. - recompute_activations (
bool
, defaults toNone
) — Enable selective activation recomputation. - use_distributed_optimizr (
bool
, defaults toNone
) — Enable distributed optimizer. - pipeline_model_parallel_split_rank (
int
, defaults toNone
) — Rank where encoder and decoder should be split. - num_layers_per_virtual_pipeline_stage (
int
, defaults toNone
) — Number of layers per virtual pipeline stage. - is_train_batch_min (
str
, defaults toTrue
) — If both tran & eval dataloaders are specified, this will decide themicro_batch_size
. - train_iters (
int
, defaults toNone
) — Total number of samples to train over all training runs. Note that either train-iters or train-samples should be provided when usingMegatronLMDummyScheduler
. - train_samples (
int
, defaults toNone
) — Total number of samples to train over all training runs. Note that either train-iters or train-samples should be provided when usingMegatronLMDummyScheduler
. - weight_decay_incr_style (
str
, defaults to'constant'
) — Weight decay increment function. choices=[“constant”, “linear”, “cosine”]. - start_weight_decay (
float
, defaults toNone
) — Initial weight decay coefficient for L2 regularization. - end_weight_decay (
float
, defaults toNone
) — End of run weight decay coefficient for L2 regularization. - lr_decay_style (
str
, defaults to'linear'
) — Learning rate decay function. choices=[‘constant’, ‘linear’, ‘cosine’]. - lr_decay_iters (
int
, defaults toNone
) — Number of iterations for learning rate decay. If None defaults totrain_iters
. - lr_decay_samples (
int
, defaults toNone
) — Number of samples for learning rate decay. If None defaults totrain_samples
. - lr_warmup_iters (
int
, defaults toNone
) — Number of iterations to linearly warmup learning rate over. - lr_warmup_samples (
int
, defaults toNone
) — Number of samples to linearly warmup learning rate over. - lr_warmup_fraction (
float
, defaults toNone
) — Fraction of lr-warmup-(iters/samples) to linearly warmup learning rate over. - min_lr (
float
, defaults to0
) — Minumum value for learning rate. The scheduler clip values below this threshold. - consumed_samples (
List
, defaults toNone
) — Number of samples consumed in the same order as the dataloaders toaccelerator.prepare
call. - no_wd_decay_cond (
Optional
, defaults toNone
) — Condition to disable weight decay. - scale_lr_cond (
Optional
, defaults toNone
) — Condition to scale learning rate. - lr_mult (
float
, defaults to1.0
) — Learning rate multiplier. - megatron_dataset_flag (
bool
, defaults toFalse
) — Whether the format of dataset follows Megatron-LM Indexed/Cached/MemoryMapped format. - seq_length (
int
, defaults toNone
) — Maximum sequence length to process. - encoder_seq_length (
int
, defaults toNone
) — Maximum sequence length to process for the encoder. - decoder_seq_length (
int
, defaults toNone
) — Maximum sequence length to process for the decoder. - tensorboard_dir (
str
, defaults toNone
) — Path to save tensorboard logs. - set_all_logging_options (
bool
, defaults toFalse
) — Whether to set all logging options. - eval_iters (
int
, defaults to100
) — Number of iterations to run for evaluation validation/test for. - eval_interval (
int
, defaults to1000
) — Interval between running evaluation on validation set. - return_logits (
bool
, defaults toFalse
) — Whether to return logits from the model. - custom_train_step_class (
Optional
, defaults toNone
) — Custom train step class. - custom_train_step_kwargs (
Optional
, defaults toNone
) — Custom train step kwargs. - custom_model_provider_function (
Optional
, defaults toNone
) — Custom model provider function. - custom_prepare_model_function (
Optional
, defaults toNone
) — Custom prepare model function. - custom_megatron_datasets_provider_function (
Optional
, defaults toNone
) — Custom megatron train_valid_test datasets provider function. - custom_get_batch_function (
Optional
, defaults toNone
) — Custom get batch function. - custom_loss_function (
Optional
, defaults toNone
) — Custom loss function. - other_megatron_args (
Optional
, defaults toNone
) — Other Megatron-LM arguments. Please refer Megatron-LM.
Plugin for Megatron-LM to enable tensor, pipeline, sequence and data parallelism. Also to enable selective activation recomputation and optimized fused kernels.
MegatronLMDummyScheduler
class accelerate.utils.MegatronLMDummyScheduler
< source >( optimizer total_num_steps = None warmup_num_steps = 0 **kwargs )
Dummy scheduler presents model parameters or param groups, this is primarily used to follow conventional training loop when scheduler config is specified in the deepspeed config file.
MegatronLMDummyDataLoader
class accelerate.utils.MegatronLMDummyDataLoader
< source >( **dataset_kwargs )
Dummy dataloader presents model parameters or param groups, this is primarily used to follow conventional training
AbstractTrainStep
Abstract class for batching, forward pass and loss handler.
GPTTrainStep
class accelerate.utils.GPTTrainStep
< source >( accelerator args )
GPT train step class.
BertTrainStep
class accelerate.utils.BertTrainStep
< source >( accelerator args )
Bert train step class.
T5TrainStep
class accelerate.utils.T5TrainStep
< source >( accelerator args )
T5 train step class.
avg_losses_across_data_parallel_group
accelerate.utils.avg_losses_across_data_parallel_group
< source >( losses )
Average losses across data parallel group.