Configuration
The IPUConfig class enables PopArt and PopTorch configuration, allowing to control the behavior of the IPUs. It is JSON-serializable, and can be loaded and saved from / to a local directory / file, as well as from / to the 🤗 Hub.
IPUConfig
class optimum.graphcore.IPUConfig
< source >( **kwargs )
Parameters
-
seed (
int
, optional) — Sets the seed for the random number generator on the IPU. -
auto_loss_scaling (
bool
, optional, defaults toFalse
) — Whether automatic loss scaling is enabled on the IPU. When using float16/half values for activations, gradients, and weights, the loss value needs to be scaled by a constant factor to avoid underflow/overflow. This adjustment is known as loss scaling. This setting automatically sets a global loss scaling factor during training. Note: This is an experimental feature and may not behave as expected. -
executable_cache_dir (
str
, optional, defaults to""
) — Enables caching the compile executables to a directory.
Parameters for controlling the batch size
-
replication_factor (
int
, optional, defaults to 1) — The number of replicas for data-parallelism during training. It depends on the size of the pipeline as well as the number of IPUs available. For example: on a Pod16, with a 4-IPU pipeline, replication_factor must be betwen 1 and 4. -
inference_replication_factor (
int
, optional, defaults to 1) — Same asreplication_factor
for inference. -
gradient_accumulation_steps (
int
, optional, defaults to 1) — Number of micro-batches to accumulate for the gradient calculation. Accumulates the gradient gradient_accumulation times before updating the model using the gradient.
Parameters related to parallelism
-
layers_per_ipu (
List[int]
) — Specifies the number of layers that will be put on each IPU for pipelined execution. For instance:[2, 3, 4, 2]
specifies a 4-IPU pipeline, where the first two layers will be put on IPU0, the following three on IPU1, the next four on IPU2 and the last two on IPU3. If the default of [-1] is used, the layers will be split evenly overipus_per_replica
IPUs. The wildcard value ‘-1’ can also be used in combination with integers. For instance:[1, 2, -1, -1]
specifies a 4-IPU pipeline, where the first layer is put on IPU0, the next two layers on IPU1, and the remaining layers split evenly between IPU2 and IPU3.
Parameters for memory management
-
optimizer_state_offchip (
bool
, optional, defaults toTrue
) — Whether to use the off chip memory to store the optimizer state or to use the on chip memory. -
replicated_tensor_sharding (
bool
, optional, defaults toFalse
) — Shards the optimizer between replicas with zero-redundancy. -
matmul_proportion (
List[float]
orfloat
, optional, defaults to 0.6) — Sets the amount of temporary memory made available on per-IPU basis. Use this setting to control the amount of temporary memory available to operations such as:- convolution
- matrix multiplication
- embedding lookups
- indexing operations
-
enable_half_partials (
bool
, optional, defaults toTrue
) — Whether the data type of partial results for matrix multiplication and convolution operators should be float16 or not. -
embedding_serialization_factor (
int
, optional, defaults to 1) — The factor to use to serialze embeddings. Nothing happens ifembedding_serialization_factor = 1
, and forembedding_serialization_factor > 1
, thetorch.nn.Embedding
layer is replaced by aoptimum.graphcore.modeling_utils.SerializedEmbedding
layer. -
recompute_checkpoint_every_layer (
bool
, optional, defaults toFalse
) — Whether to use gradient checkpointing at the end of every layer. It can help in reducing the memory impact.
Parameters related to host / device synchronization
-
device_iterations (
int
, optional, defaults to 1) — Number of iterations the device should run over the data before returning to the user during training. This is equivalent to running the IPU in a loop over that the specified number of iterations, with a new batch of data each time. However, increasing deviceIterations is more efficient because the loop runs on the IPU directly. -
inference_device_iterations (
int
, optional, defaults to 1) — Same asdevice_iterations
for inference. -
output_mode (
str
, optional, defaults to"final"
) — Specifies which data to return from a model. Allowed values:all
: returns a result for each batch.sum
: returns the sum of all batches.final
: returns the last batch.default
:all
for inference,final
for training.
Class for PopArt and PopTorch configuration. Handles the conversion to poptorch options as well as configuration pod type specialization.
batch_size_factor
< source >(
for_inference: bool = False
pod_type: typing.Optional[str] = None
)
→
int
Computes the factor to apply to the micro batch size to get the combined batch size.
for_pod_type
< source >(
pod_type: typing.Optional[str] = None
)
→
IPUConfig
Creates an IPUConfig
specialized for a POD type.
to_options
< source >(
for_inference: bool = False
compile_only: bool = False
pod_type: typing.Optional[str] = None
)
→
poptorch.Options
Parameters
-
for_inference (
bool
, defaults toFalse
) — If True, the resulting poptorch.Options will be adapted inference, it will be adapted for training otherwise. -
compile_only (
bool
, defaults toFalse
) — If True, compilation will be performed offline, no IPUs required. -
pod_type (
str
, optional) — The POD type to specialize thepoptorch.Options
for.
Returns
poptorch.Options
The option representing the IPUConfig
.
Creates a poptorch.Options
from the IPUConfig
.
update_from_string
< source >( update_str: str )
Updates attributes of this class with attributes from update_str
.
The expected format is ints, floats and strings as is, and for booleans use true
or false
, and for lists
use [a b c d]
. For example: "n_embd=10,resid_pdrop=0.2,scale_attn_weights=false,summary_type=cls_index, matmul_proportion=[0.08 0.2 0.25 0.25]"
.
The keys to change have to already exist in the config object.