Megatron-LM utilities
MegatronLMPlugin
class accelerate.utils.MegatronLMPlugin
< source >( tp_degree: int = None pp_degree: int = None num_micro_batches: int = None gradient_clipping: float = None sequence_parallelism: bool = None recompute_activations: bool = None use_distributed_optimizer: bool = None pipeline_model_parallel_split_rank: int = None num_layers_per_virtual_pipeline_stage: int = None is_train_batch_min: str = True train_iters: int = None train_samples: int = None weight_decay_incr_style: str = 'constant' start_weight_decay: float = None end_weight_decay: float = None lr_decay_style: str = 'linear' lr_decay_iters: int = None lr_decay_samples: int = None lr_warmup_iters: int = None lr_warmup_samples: int = None lr_warmup_fraction: float = None min_lr: float = 0 consumed_samples: List = None no_wd_decay_cond: Optional = None scale_lr_cond: Optional = None lr_mult: float = 1.0 megatron_dataset_flag: bool = False seq_length: int = None encoder_seq_length: int = None decoder_seq_length: int = None tensorboard_dir: str = None set_all_logging_options: bool = False eval_iters: int = 100 eval_interval: int = 1000 return_logits: bool = False custom_train_step_class: Optional = None custom_train_step_kwargs: Optional = None custom_model_provider_function: Optional = None custom_prepare_model_function: Optional = None custom_megatron_datasets_provider_function: Optional = None custom_get_batch_function: Optional = None custom_loss_function: Optional = None other_megatron_args: Optional = None )
Parameters
- tp_degree (
int
, defaults toNone
) — Tensor parallelism degree. - pp_degree (
int
, defaults toNone
) — Pipeline parallelism degree. - num_micro_batches (
int
, defaults toNone
) — Number of micro-batches. - gradient_clipping (
float
, defaults toNone
) — Gradient clipping value based on global L2 Norm (0 to disable). - sequence_parallelism (
bool
, defaults toNone
) — Enable sequence parallelism. - recompute_activations (
bool
, defaults toNone
) — Enable selective activation recomputation. - use_distributed_optimizr (
bool
, defaults toNone
) — Enable distributed optimizer. - pipeline_model_parallel_split_rank (
int
, defaults toNone
) — Rank where encoder and decoder should be split. - num_layers_per_virtual_pipeline_stage (
int
, defaults toNone
) — Number of layers per virtual pipeline stage. - is_train_batch_min (
str
, defaults toTrue
) — If both tran & eval dataloaders are specified, this will decide themicro_batch_size
. - train_iters (
int
, defaults toNone
) — Total number of samples to train over all training runs. Note that either train-iters or train-samples should be provided when usingMegatronLMDummyScheduler
. - train_samples (
int
, defaults toNone
) — Total number of samples to train over all training runs. Note that either train-iters or train-samples should be provided when usingMegatronLMDummyScheduler
. - weight_decay_incr_style (
str
, defaults to'constant'
) — Weight decay increment function. choices=[“constant”, “linear”, “cosine”]. - start_weight_decay (
float
, defaults toNone
) — Initial weight decay coefficient for L2 regularization. - end_weight_decay (
float
, defaults toNone
) — End of run weight decay coefficient for L2 regularization. - lr_decay_style (
str
, defaults to'linear'
) — Learning rate decay function. choices=[‘constant’, ‘linear’, ‘cosine’]. - lr_decay_iters (
int
, defaults toNone
) — Number of iterations for learning rate decay. If None defaults totrain_iters
. - lr_decay_samples (
int
, defaults toNone
) — Number of samples for learning rate decay. If None defaults totrain_samples
. - lr_warmup_iters (
int
, defaults toNone
) — Number of iterations to linearly warmup learning rate over. - lr_warmup_samples (
int
, defaults toNone
) — Number of samples to linearly warmup learning rate over. - lr_warmup_fraction (
float
, defaults toNone
) — Fraction of lr-warmup-(iters/samples) to linearly warmup learning rate over. - min_lr (
float
, defaults to0
) — Minumum value for learning rate. The scheduler clip values below this threshold. - consumed_samples (
List
, defaults toNone
) — Number of samples consumed in the same order as the dataloaders toaccelerator.prepare
call. - no_wd_decay_cond (
Optional
, defaults toNone
) — Condition to disable weight decay. - scale_lr_cond (
Optional
, defaults toNone
) — Condition to scale learning rate. - lr_mult (
float
, defaults to1.0
) — Learning rate multiplier. - megatron_dataset_flag (
bool
, defaults toFalse
) — Whether the format of dataset follows Megatron-LM Indexed/Cached/MemoryMapped format. - seq_length (
int
, defaults toNone
) — Maximum sequence length to process. - encoder_seq_length (
int
, defaults toNone
) — Maximum sequence length to process for the encoder. - decoder_seq_length (
int
, defaults toNone
) — Maximum sequence length to process for the decoder. - tensorboard_dir (
str
, defaults toNone
) — Path to save tensorboard logs. - set_all_logging_options (
bool
, defaults toFalse
) — Whether to set all logging options. - eval_iters (
int
, defaults to100
) — Number of iterations to run for evaluation validation/test for. - eval_interval (
int
, defaults to1000
) — Interval between running evaluation on validation set. - return_logits (
bool
, defaults toFalse
) — Whether to return logits from the model. - custom_train_step_class (
Optional
, defaults toNone
) — Custom train step class. - custom_train_step_kwargs (
Optional
, defaults toNone
) — Custom train step kwargs. - custom_model_provider_function (
Optional
, defaults toNone
) — Custom model provider function. - custom_prepare_model_function (
Optional
, defaults toNone
) — Custom prepare model function. - custom_megatron_datasets_provider_function (
Optional
, defaults toNone
) — Custom megatron train_valid_test datasets provider function. - custom_get_batch_function (
Optional
, defaults toNone
) — Custom get batch function. - custom_loss_function (
Optional
, defaults toNone
) — Custom loss function. - other_megatron_args (
Optional
, defaults toNone
) — Other Megatron-LM arguments. Please refer Megatron-LM.
Plugin for Megatron-LM to enable tensor, pipeline, sequence and data parallelism. Also to enable selective activation recomputation and optimized fused kernels.
MegatronLMDummyScheduler
class accelerate.utils.MegatronLMDummyScheduler
< source >( optimizer total_num_steps = None warmup_num_steps = 0 **kwargs )
Dummy scheduler presents model parameters or param groups, this is primarily used to follow conventional training loop when scheduler config is specified in the deepspeed config file.
MegatronLMDummyDataLoader
Dummy dataloader presents model parameters or param groups, this is primarily used to follow conventional training
AbstractTrainStep
Abstract class for batching, forward pass and loss handler.
GPTTrainStep
class accelerate.utils.GPTTrainStep
< source >( accelerator args )
GPT train step class.
BertTrainStep
class accelerate.utils.BertTrainStep
< source >( accelerator args )
Bert train step class.
T5TrainStep
class accelerate.utils.T5TrainStep
< source >( accelerator args )
T5 train step class.
avg_losses_across_data_parallel_group
accelerate.utils.avg_losses_across_data_parallel_group
< source >( losses )
Average losses across data parallel group.