Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
| ## | |
| Below is an example yaml for BF16 mixed-precision training using Megatron-LM with 2x Data Parallelism, 2x Pipeline Parallelism, and 2x Tensor Parallelism on 8 GPUs. It is also using Sequence Parallelism, selective activation checkpointing, and a sharded optimizer. | |
| <pre> | |
| compute_environment: LOCAL_MACHINE | |
| deepspeed_config: {} | |
| +distributed_type: MEGATRON_LM | |
| downcast_bf16: 'no' | |
| dynamo_backend: 'NO' | |
| fsdp_config: {} | |
| machine_rank: 0 | |
| main_training_function: main | |
| +megatron_lm_config: | |
| + megatron_lm_gradient_clipping: 1.0 | |
| + megatron_lm_num_micro_batches: 2 | |
| + megatron_lm_pp_degree: 2 | |
| + megatron_lm_recompute_activations: true | |
| + megatron_lm_sequence_parallelism: true | |
| + megatron_lm_tp_degree: 2 | |
| + megatron_lm_use_distributed_optimizer: true | |
| mixed_precision: bf16 | |
| num_machines: 1 | |
| num_processes: 8 | |
| rdzv_backend: static | |
| same_network: true | |
| use_cpu: false | |
| </pre> | |
| ## | |
| <pre> | |
| from accelerate import Accelerator | |
| +from accelerate.utils import MegatronLMDummyScheduler | |
| accelerator = Accelerator() | |
| ... | |
| -lr_scheduler = get_scheduler( | |
| - name=args.lr_scheduler_type, | |
| - ... | |
| -) | |
| +lr_scheduler = MegatronLMDummyScheduler( | |
| + optimizer=optimizer, | |
| + num_warmup_steps=..., | |
| + num_training_steps=..., | |
| +) | |
| model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare( | |
| model, optimizer, train_dataloader, eval_dataloader, lr_scheduler | |
| ) | |
| total_batch_size = ( | |
| - args.per_device_train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps | |
| + accelerator.state.megatron_lm_plugin.global_batch_size | |
| ) | |
| # in evaluation loop | |
| for step, batch in enumerate(eval_dataloader): | |
| with torch.no_grad(): | |
| outputs = model(**batch) | |
| loss = outputs.loss | |
| - losses.append(accelerator.gather_for_metrics(loss.repeat(args.per_device_eval_batch_size))) | |
| + losses.append(loss) # For Megatron-LM, the losses are already averaged across the data parallel group | |
| -losses = torch.cat(losses) | |
| +losses = torch.tensor(losses) | |
| </pre> | |
| ## | |
| If the YAML was generated through the `accelerate config` command: | |
| ``` | |
| accelerate launch {script_name.py} {--arg1} {--arg2} ... | |
| ``` | |
| If the YAML is saved to a `~/config.yaml` file: | |
| ``` | |
| accelerate launch --config_file ~/config.yaml {script_name.py} {--arg1} {--arg2} ... | |
| ``` | |
| Or you can use `accelerate launch` with right configuration parameters and have no `config.yaml` file: | |
| ``` | |
| accelerate launch \ | |
| --use_megatron_lm \ | |
| --num_processes=8 \ | |
| --mixed_precision=bf16 \ | |
| --megatron_lm_tp_degree=2 \ | |
| --megatron_lm_pp_degree=2 \ | |
| --megatron_lm_num_micro_batches=2 \ | |
| --megatron_lm_sequence_parallelism=true \ | |
| --megatron_lm_recompute_activations=true \ | |
| --megatron_lm_use_distributed_optimizer=true \ | |
| {script_name.py} {--arg1} {--arg2} ... | |
| ``` | |
| ## | |
| For Megatron-LM, the supported models Transformers GPT2, Megatron-BERT and T5 models covering Decoder only, Encode only and Encoder-Decoder model classes. Given the complexity of the features of Megatron-LM, 4 changes that are required to get started are: | |
| 1. Using `accelerate.utils.MegatronLMDummyScheduler` as Megatron-LM uses its own implementation of Optimizer, the corresponding scheduler compatible with it needs to be used. | |
| 2. Getting the details of the total batch size now needs to be cognization of tensor and pipeline parallel sizes. | |
| 3. Losses are already averaged across the data parallel group | |
| 4. save the model using `accelerator.save_state` instead of transformers `from_pretrianed` | |
| The Accelerate Megatron-LM integration supports many advanced features such as: | |
| - Leveraging custom training steps | |
| - Using Megatron-LM indexed datasets | |
| - Checkpoint reshaping and interoperabiloity utilities | |
| - Using `megatron_generate` for text generation using Tensor and Pipeline Parallism | |
| - Support for ROPE/ALibi Positional embeddings and Multi-Query Attention | |
| However, each of these require more changes to your source code than what is presented here. | |
| ## | |
| To learn more checkout the related documentation: | |
| - <a href="https://huggingface.co/docs/accelerate/usage_guides/megatron_lm" target="_blank">How to use Megatron-LM</a> | |
| - <a href="https://github.com/pacman100/accelerate-megatron-test" target="_blank">Examples showcasing the Megatron-LM integration of Accelerate</a> |