Multi-node Training
Using several Gaudi servers to perform multi-node training can be done easily. This guide shows how to:
- set up several Gaudi instances
- set up your computing environment
- launch a multi-node run
Setting up several Gaudi instances
Two types of configurations are possible:
- scale-out using Gaudi NICs or Host NICs (on-premises)
- scale-out using Intel® Tiber™ AI Cloud instances
On premises
To set up your servers on premises, check out the installation and distributed training pages of Intel® Gaudi® AI Accelerator’s documentation.
Intel Tiber AI Cloud instances
Follow the steps on creating an account and getting an instance pages of Intel® Gaudi® AI Accelerator’s documentation.
Launching a Multi-node Run
Once your Intel Gaudi instances are ready, follow the steps for setting up a multi-server environment pages of Intel® Gaudi® AI Accelerator’s documentation.
Finally, there are two possible ways to run your training script on several nodes:
- With the
gaudi_spawn.py
script, you can run the following command:
python gaudi_spawn.py \ --hostfile path_to_my_hostfile --use_deepspeed \ path_to_my_script.py --args1 --args2 ... --argsN \ --deepspeed path_to_my_deepspeed_config
where --argX
is an argument of the script to run.
- With the
DistributedRunner
, you can add this code snippet to a script:
from optimum.habana.distributed import DistributedRunner
distributed_runner = DistributedRunner(
command_list=["path_to_my_script.py --args1 --args2 ... --argsN"],
hostfile=path_to_my_hostfile,
use_deepspeed=True,
)
Environment Variables
If you need to set environment variables for all nodes, you can specify them in a .deepspeed_env
file which should be located in the local path you are executing from or in your home directory. The format is the following:
env_variable_1_name=value
env_variable_2_name=value
...
Recommendations
- It is strongly recommended to use gradient checkpointing for multi-node runs to get the highest speedups. You can enable it with
--gradient_checkpointing
in these examples or withgradient_checkpointing=True
in yourGaudiTrainingArguments
. - Larger batch sizes should lead to higher speedups.
- Multi-node inference is not recommended and can provide inconsistent results.
- On Intel Tiber AI Cloud instances, run your Docker containers with the
--privileged
flag so that EFA devices are visible.
Example
In this example, we fine-tune a pre-trained GPT2-XL model on the WikiText dataset. We are going to use the causal language modeling example which is given in the Github repository.
The first step consists in training the model on several nodes with this command:
python ../gaudi_spawn.py \ --hostfile path_to_hostfile --use_deepspeed run_clm.py \ --model_name_or_path gpt2-xl \ --gaudi_config_name Habana/gpt2 \ --dataset_name wikitext \ --dataset_config_name wikitext-2-raw-v1 \ --do_train \ --output_dir /tmp/gpt2_xl_multi_node \ --learning_rate 4e-04 \ --per_device_train_batch_size 16 \ --gradient_checkpointing \ --num_train_epochs 1 \ --use_habana \ --use_lazy_mode \ --throughput_warmup_steps 3 \ --deepspeed path_to_deepspeed_config
Evaluation is not performed in the same command because we do not recommend performing multi-node inference at the moment.
Once the model is trained, we can evaluate it with the following command.
The argument --model_name_or_path
should be equal to the argument --output_dir
of the previous command.
python run_clm.py \ --model_name_or_path /tmp/gpt2_xl_multi_node \ --gaudi_config_name Habana/gpt2 \ --dataset_name wikitext \ --dataset_config_name wikitext-2-raw-v1 \ --do_eval \ --output_dir /tmp/gpt2_xl_multi_node \ --per_device_eval_batch_size 8 \ --use_habana \ --use_lazy_mode