Accelerating Training

Gaudi offers several possibilities to make training faster. They are all compatible with each other and can be coupled with distributed training.

Lazy Mode

Two execution modes are proposed:

Lazy mode, where operations are accumulated in a graph whose execution is triggered in a lazy manner. This allows the graph compiler to optimize the device execution for these operations.
Eager mode, where one operation at a time is executed.

In lazy mode, the graph compiler generates optimized binary code that implements the given model topology on Gaudi. It performs operator fusion, data layout management, parallelization, pipelining and memory management, as well as graph-level optimizations.

To execute your training in lazy mode, you must provide the following training arguments:

args = GaudiTrainingArguments(
    # same arguments as in Transformers,
    use_habana=True,
    use_lazy_mode=True,
    gaudi_config_name=path_to_my_gaudi_config
)

In lazy mode, the last batch is filled with extra samples by default so that it has the same dimensions as previous batches. This enables to avoid extra graph compilations during training. You can also discard the last batch with dataloader_drop_last=True.

In lazy mode, the first couple of training iterations may be slower due to graph compilations. To not take them into account in the computation of the throughput at the end of the training, you can add the following training argument: throughput_warmup_steps=2.

Mixed-Precision Training

Mixed-precision training enables to compute some operations using lighter data types to accelerate training. Habana Mixed Precision (HMP) proposes to mix fp32 and bf16 operations.

Please refer to the list of supported PyTorch operators beforehand to make sure the ones you are interested in are compatible with bf16.

To apply HMP, you must set "use_habana_mixed_precision" to true in the Gaudi configuration file. Then, you can specify which operators to compute in bf16 with "hmp_bf16_ops" and which operators to compute in fp32 with "hmp_fp32_ops". If these operators are not specified, their default values are set to be the ones written in the Gaudi configuration file of BERT, which is a good starting point for applying HMP:

"hmp_bf16_ops": [
    "add",
    "addmm",
    "bmm",
    "div",
    "dropout",
    "gelu",
    "iadd",
    "linear",
    "layer_norm",
    "matmul",
    "mm",
    "rsub",
    "softmax",
    "truediv"
],
"hmp_fp32_ops": [
    "embedding",
    "nll_loss",
    "log_softmax"
]

Custom Operators

Habana provides a few custom operators that achieve better performance than their PyTorch counterparts on Gaudi. You can also define your own custom operator for Gaudi as described here.

Fused ADAM

Habana provides a custom fused ADAM implementation. It can be used by specifying "use_fused_adam": true in the Gaudi configuration file.

The default value of epsilon is 1e-6 for the Habana fused ADAM optimizer, while it is 1e-8 for torch.optim.AdamW.

Fused Gradient Norm Clipping

Habana provides a custom gradient norm clipping implementation. It can be used by specifying "use_fused_clip_norm": true in the Gaudi configuration file.

Tracking Memory Usage

Live memory statistics are displayed every logging_steps (default is 500) steps:

memory_allocated (GB) refers to the current memory consumption in GB,
max_memory_allocated (GB) refers to the maximum memory consumption reached during the run in GB,
total_memory_available (GB) refers to the total memory available on the device in GB.

These metrics can help you to adjust the batch size of your runs.

In distributed mode, memory stats are communicated only by the main process.

You can take a look at Habana Gaudi’s official documentation for more information about the memory stats API.