Multi-node Training

Using several Gaudi servers to perform multi-node training can be done easily. This guide shows how to:

set up several Gaudi instances
set up your computing environment
launch a multi-node run

Setting up several Gaudi instances

Two types of configurations are possible:

scale-out using Gaudi NICs or Host NICs (on-premises)
scale-out using Intel® Tiber™ AI Cloud instances

On premises

To set up your servers on premises, check out the installation and distributed training pages of Intel® Gaudi® AI Accelerator’s documentation.

Intel Tiber AI Cloud instances

Follow the steps on creating an account and getting an instance pages of Intel® Gaudi® AI Accelerator’s documentation.

Launching a Multi-node Run

Once your Intel Gaudi instances are ready, follow the steps for setting up a multi-server environment pages of Intel® Gaudi® AI Accelerator’s documentation.

Finally, there are two possible ways to run your training script on several nodes:

With the gaudi_spawn.py script, you can run the following command:

python gaudi_spawn.py \
    --hostfile path_to_my_hostfile --use_deepspeed \
    path_to_my_script.py --args1 --args2 ... --argsN \
    --deepspeed path_to_my_deepspeed_config

where --argX is an argument of the script to run.

With the DistributedRunner, you can add this code snippet to a script:

from optimum.habana.distributed import DistributedRunner

distributed_runner = DistributedRunner(
    command_list=["path_to_my_script.py --args1 --args2 ... --argsN"],
    hostfile=path_to_my_hostfile,
    use_deepspeed=True,
)

Environment Variables

If you need to set environment variables for all nodes, you can specify them in a .deepspeed_env file which should be located in the local path you are executing from or in your home directory. The format is the following:

env_variable_1_name=value
env_variable_2_name=value
...

Recommendations

It is strongly recommended to use gradient checkpointing for multi-node runs to get the highest speedups. You can enable it with --gradient_checkpointing in these examples or with gradient_checkpointing=True in your GaudiTrainingArguments.
Larger batch sizes should lead to higher speedups.
Multi-node inference is not recommended and can provide inconsistent results.
On Intel Tiber AI Cloud instances, run your Docker containers with the --privileged flag so that EFA devices are visible.

Example

In this example, we fine-tune a pre-trained GPT2-XL model on the WikiText dataset. We are going to use the causal language modeling example which is given in the Github repository.

The first step consists in training the model on several nodes with this command:

python ../gaudi_spawn.py \
    --hostfile path_to_hostfile --use_deepspeed run_clm.py \
    --model_name_or_path gpt2-xl \
    --gaudi_config_name Habana/gpt2 \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --do_train \
    --output_dir /tmp/gpt2_xl_multi_node \
    --learning_rate 4e-04 \
    --per_device_train_batch_size 16 \
    --gradient_checkpointing \
    --num_train_epochs 1 \
    --use_habana \
    --use_lazy_mode \
    --throughput_warmup_steps 3 \
    --deepspeed path_to_deepspeed_config

Evaluation is not performed in the same command because we do not recommend performing multi-node inference at the moment.

Once the model is trained, we can evaluate it with the following command. The argument --model_name_or_path should be equal to the argument --output_dir of the previous command.

python run_clm.py \
    --model_name_or_path /tmp/gpt2_xl_multi_node \
    --gaudi_config_name Habana/gpt2 \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --do_eval \
    --output_dir /tmp/gpt2_xl_multi_node \
    --per_device_eval_batch_size 8 \
    --use_habana \
    --use_lazy_mode

< > Update on GitHub

Optimum