Spaces:

hf-accelerate
/

accelerate_examples

Running on CPU Upgrade

App Files Files Community

accelerate_examples / code_samples /training_configuration /megatron-lm

muellerzr

Refactor

06a60a3 over 2 years ago

raw

history blame

4.24 kB

	##
	Below is an example yaml for BF16 mixed-precision training using Megatron-LM with 2x Data Parallelism, 2x Pipeline Parallelism, and 2x Tensor Parallelism on 8 GPUs. It is also using Sequence Parallelism, selective activation checkpointing, and a sharded optimizer.
	<pre>
	compute_environment: LOCAL_MACHINE
	deepspeed_config: {}
	+distributed_type: MEGATRON_LM
	downcast_bf16: 'no'
	dynamo_backend: 'NO'
	fsdp_config: {}
	machine_rank: 0
	main_training_function: main
	+megatron_lm_config:
	+ megatron_lm_gradient_clipping: 1.0
	+ megatron_lm_num_micro_batches: 2
	+ megatron_lm_pp_degree: 2
	+ megatron_lm_recompute_activations: true
	+ megatron_lm_sequence_parallelism: true
	+ megatron_lm_tp_degree: 2
	+ megatron_lm_use_distributed_optimizer: true
	mixed_precision: bf16
	num_machines: 1
	num_processes: 8
	rdzv_backend: static
	same_network: true
	use_cpu: false
	</pre>
	##
	<pre>
	from accelerate import Accelerator
	+from accelerate.utils import MegatronLMDummyScheduler

	accelerator = Accelerator()

	...

	-lr_scheduler = get_scheduler(
	- name=args.lr_scheduler_type,
	- ...
	-)
	+lr_scheduler = MegatronLMDummyScheduler(
	+ optimizer=optimizer,
	+ num_warmup_steps=...,
	+ num_training_steps=...,
	+)
	model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
	model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
	)

	total_batch_size = (
	- args.per_device_train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
	+ accelerator.state.megatron_lm_plugin.global_batch_size
	)
	# in evaluation loop
	for step, batch in enumerate(eval_dataloader):
	with torch.no_grad():
	outputs = model(**batch)
	loss = outputs.loss
	- losses.append(accelerator.gather_for_metrics(loss.repeat(args.per_device_eval_batch_size)))
	+ losses.append(loss) # For Megatron-LM, the losses are already averaged across the data parallel group
	-losses = torch.cat(losses)
	+losses = torch.tensor(losses)
	</pre>
	##
	If the YAML was generated through the `accelerate config` command:
	```
	accelerate launch {script_name.py} {--arg1} {--arg2} ...
	```

	If the YAML is saved to a `~/config.yaml` file:
	```
	accelerate launch --config_file ~/config.yaml {script_name.py} {--arg1} {--arg2} ...
	```

	Or you can use `accelerate launch` with right configuration parameters and have no `config.yaml` file:
	```
	accelerate launch \
	--use_megatron_lm \
	--num_processes=8 \
	--mixed_precision=bf16 \
	--megatron_lm_tp_degree=2 \
	--megatron_lm_pp_degree=2 \
	--megatron_lm_num_micro_batches=2 \
	--megatron_lm_sequence_parallelism=true \
	--megatron_lm_recompute_activations=true \
	--megatron_lm_use_distributed_optimizer=true \
	{script_name.py} {--arg1} {--arg2} ...
	```

	##
	For Megatron-LM, the supported models Transformers GPT2, Megatron-BERT and T5 models covering Decoder only, Encode only and Encoder-Decoder model classes. Given the complexity of the features of Megatron-LM, 4 changes that are required to get started are:
	1. Using `accelerate.utils.MegatronLMDummyScheduler` as Megatron-LM uses its own implementation of Optimizer, the corresponding scheduler compatible with it needs to be used.
	2. Getting the details of the total batch size now needs to be cognization of tensor and pipeline parallel sizes.
	3. Losses are already averaged across the data parallel group
	4. save the model using `accelerator.save_state` instead of transformers `from_pretrianed`

	The Accelerate Megatron-LM integration supports many advanced features such as:
	- Leveraging custom training steps
	- Using Megatron-LM indexed datasets
	- Checkpoint reshaping and interoperabiloity utilities
	- Using `megatron_generate` for text generation using Tensor and Pipeline Parallism
	- Support for ROPE/ALibi Positional embeddings and Multi-Query Attention

	However, each of these require more changes to your source code than what is presented here.

	##
	To learn more checkout the related documentation:
	- <a href="https://huggingface.co/docs/accelerate/usage_guides/megatron_lm" target="_blank">How to use Megatron-LM</a>
	- <a href="https://github.com/pacman100/accelerate-megatron-test" target="_blank">Examples showcasing the Megatron-LM integration of Accelerate</a>