Accelerating Training

Gaudi offers several possibilities to make training faster. They are all compatible with each other and can be coupled with distributed training.

Lazy Mode

Two execution modes are proposed:

Lazy mode, where operations are accumulated in a graph whose execution is triggered in a lazy manner. This allows the graph compiler to optimize the device execution for these operations.
Eager mode, where one operation at a time is executed.

In lazy mode, the graph compiler generates optimized binary code that implements the given model topology on Gaudi. It performs operator fusion, data layout management, parallelization, pipelining and memory management, as well as graph-level optimizations.

To execute your training in lazy mode, you must provide the following training arguments:

args = GaudiTrainingArguments(
    # same arguments as in Transformers,
    use_habana=True,
    use_lazy_mode=True,
    gaudi_config_name=path_to_my_gaudi_config
)

In lazy mode, the last batch is filled with extra samples by default so that it has the same dimensions as previous batches. This enables to avoid extra graph compilations during training. You can also discard the last batch with dataloader_drop_last=True.

In lazy mode, the first two or three training iterations may be slower due to graph compilations. To not take them into account in the computation of the throughput at the end of the training, you can add the following training argument: throughput_warmup_steps=3.

Mixed-Precision Training

Mixed-precision training enables to compute some operations using lighter data types to accelerate training. Optimum Habana enables mixed precision training in a similar fashion as 🤗 Transformers:

argument --bf16 enables usage of PyTorch autocast
argument --half_precision_backend [hpu_amp, cpu_amp] is used to specify a device on which mixed precision operations should be performed

Please refer to the advanced autocast usage on Gaudi for more informations regarding:

default autocast operations
default autocast operations override

HPU Graphs

The flexibility of PyTorch comes at a price - usually the same pythonic logic is processed every training step over and over. This may lead to a situation where it takes longer for CPU to schedule the work on Habana accelerator than it is effectively computed by it. To cope with such host-bound workloads, you may want to try enabling the HPU Graphs feature, which records the computational graph once, then only triggers it for execution much faster multiple times.

To do so, specify --use_hpu_graphs_for_training True. This option will wrap the model in habana_frameworks.torch.hpu.ModuleCacher, which automatically records HPU Graphs on the model’s usage.

For multi-worker distributed training, you also need to specify --distribution_strategy fast_ddp. This option replaces the usage of torch.nn.parallel.DistributedDataParallel with much simpler and usually faster optimum.habana.distributed.all_reduce_gradients.

Use with caution: currently using HPU Graphs for training may not support all the possible cases. However, the potential performance gain could be dramatic!

Fast DDP

For distributed training on several devices, you can also specify --distribution_strategy fast_ddp. This option replaces the usage of torch.nn.parallel.DistributedDataParallel with much simpler and usually faster optimum.habana.distributed.all_reduce_gradients.

Pipelining Forward and Backward Passes

There are two stages when running models on Habana HPU: python code interpretation on CPU and HPU recipe computation. The HPU computation stage can be triggered manually or when a copy to the CPU is requested, and generally HPU computation is triggered after loss.backward() to make the CPU code interpretation and HPU recipe computation overlap as shown in the following illustration:

CPU:...forward + backward   ...optimizer  ...forward + backward   ...optimizer  ...
HPU:........................forward + backward...optimizer......forward + backward...optimizer

However, when CPU code interpretation takes longer than HPU computation, it becomes the bottleneck and HPU computation can not be triggered until CPU code interpretation is done. So one potential optimization for such cases is to trigger the HPU forward computation right after the CPU forward interpretation and before the CPU backward interpretation. You can see an example below where the CPU backward interpretation overlaps with the HPU forward computation:

CPU:...forward   ...backward   ...optimizer  ...forward   ...backward   ...optimizer   ...
HPU:.............forward.......backward......optimizer......forward.....backward.......optimizer

To enable this optimization, you can set the following training argument --pipelining_fwd_bwd True.

We recommend using it on Gaudi2 as the host will often be the bottleneck. You should be able to see a speedup on first-generation Gaudi too, but it will be less significant than on Gaudi2 because your run is more likely to be HPU-bound.

Furthermore, when training models that require large device memory, we suggest disabling this optimization because it will increase the HPU memory usage.

Use More Workers for Data Loading

If the workload of the data loader is heavy, you can increase the number of workers to make your run faster. You can enable this with the training argument --dataloader_num_workers N with N being the number of workers to use.

We recommend using it with datasets containing images. Besides, using --dataloader_num_workers 1 should help in most cases as it enables data loading in a thread different from the main one.

Non-Blocking Data Copy

This optimization is well-suited for models with a high cost of copying data from the host to the device (e.g. vision models like ViT or Swin). You can enable it with the training argument --non_blocking_data_copy True.

We recommend using it on Gaudi2 where the host can continue to execute other tasks (e.g. graph building) to get a better pipelining between the host and the device. On first-generation Gaudi, the device executing time is longer so one should not expect to get any speedup.

Custom Operators

Habana provides a few custom operators that achieve better performance than their PyTorch counterparts on Gaudi. You can also define your own custom operator for Gaudi as described here.

Fused ADAM

Habana provides a custom fused ADAM implementation. It can be used by specifying "use_fused_adam": true in the Gaudi configuration file.

The default value of epsilon is 1e-6 for the Habana fused ADAM optimizer, while it is 1e-8 for torch.optim.AdamW.

Fused Gradient Norm Clipping

Habana provides a custom gradient norm clipping implementation. It can be used by specifying "use_fused_clip_norm": true in the Gaudi configuration file.

Tracking Memory Usage

Live memory statistics are displayed every logging_steps (default is 500) steps:

memory_allocated (GB) refers to the current memory consumption in GB,
max_memory_allocated (GB) refers to the maximum memory consumption reached during the run in GB,
total_memory_available (GB) refers to the total memory available on the device in GB.

These metrics can help you to adjust the batch size of your runs.

In distributed mode, memory stats are communicated only by the main process.

You can take a look at Habana Gaudi’s official documentation for more information about the memory stats API.