Gaudi offers several possibilities to make training faster. They are all compatible with each other and can be coupled with distributed training.
Two execution modes are proposed:
- Lazy mode, where the Habana bridge internally accumulates operations in a graph. The execution of the operations in the accumulated graph is triggered in a lazy manner. This allows the bridge to construct a graph with multiple operations, which provides the graph compiler the opportunity to optimize the device execution for these operations.
- Eager mode, where one operation at a time is executed.
In lazy mode, the graph compiler generates optimized binary code that implements the given model topology on Gaudi. It performs operator fusion, data layout management, parallelization, pipelining and memory management, as well as graph-level optimizations.
To execute your training in lazy mode, you must provide the following training arguments:
args = GaudiTrainingArguments( # same arguments as in Transformers, use_habana=True, use_lazy_mode=True, gaudi_config_name=path_to_my_gaudi_config )
In lazy mode, the first couple of training iterations may be slower due to graph compilations.
In order to not take them into account in the computation of the throughput at the end of the training, you can add the following training argument:
Mixed-precision training enables to compute some operations using lighter data types to accelerate training. Habana Mixed Preicision (HMP) proposes to mix fp32 and bf16 operations.
Please refer to the list of supported PyTorch operators beforehand to make sure the ones you are interested in are compatible with bf16.
In order to apply HMP, you must set
"O1" in the Gaudi configuration file.
Then, you can specify which operators to compute in bf16 with
"hmp_bf16_ops" and which operators to compute in fp32 with
If these operators are not specified, their default values are set to be the ones written in the Gaudi configuration file of BERT, which is a good starting point for applying HMP:
"hmp_bf16_ops": [ "add", "addmm", "bmm", "div", "dropout", "gelu", "iadd", "linear", "layer_norm", "matmul", "mm", "rsub", "softmax", "truediv" ], "hmp_fp32_ops": [ "embedding", "nll_loss", "log_softmax" ]
Habana probides a few custom operators that achieve better performance than their PyTorch counterparts on Gaudi. You can also define your own custom operator for Gaudi as described here.
Habana provides a custom fused ADAM implementation.
It can be used by specifying
"use_fused_adam": true in the Gaudi configuration file.
The default value of epsilon is
1e-6 for the Habana fused ADAM optimizer, while it is
Habana provides a custom gradient norm clipping implementation.
It can be used by specifying
"use_fused_clip_norm": true in the Gaudi configuration file.
Live memory statistics are displayed every
logging_steps (default is 500) steps:
memory_allocated (GB)refers to the current memory consumption in GB,
max_memory_allocated (GB)refers to the maximum memory consumption reached during the run in GB,
total_memory_available (GB)refers to the total memory available on the device in GB.
These metrics can help you to adjust the batch size of your runs.
In distributed mode, memory stats are communicated only by the main process.
You can take a look at Habana Gaudi’s official documentation for more information about the memory stats API.