Utilities for Trainer

This page lists all the utility functions used by Trainer.

Most of those are only useful if you are studying the code of the Trainer in the library.


class transformers.EvalPrediction(predictions: Union[numpy.ndarray, Tuple[numpy.ndarray]], label_ids: numpy.ndarray)[source]

Evaluation output (always contains labels), to be used to compute metrics.

  • predictions (np.ndarray) – Predictions of the model.

  • label_ids (np.ndarray) – Targets to be matched.

transformers.set_seed(seed: int)[source]

Helper function for reproducible behavior to set the seed in random, numpy, torch and/or tf (if installed).


seed (int) – The seed to set.

transformers.torch_distributed_zero_first(local_rank: int)[source]

Decorator to make all processes in distributed training wait for each local_master to do something.


local_rank (int) – The rank of the local process.

Callbacks internals

class transformers.trainer_callback.CallbackHandler(callbacks, model, optimizer, lr_scheduler)[source]

Internal class that just calls the list of callbacks in order.

Distributed Evaluation

class transformers.trainer_pt_utils.DistributedTensorGatherer(world_size, num_samples, make_multiple_of=None, padding_index=- 100)[source]

A class responsible for properly gathering tensors (or nested list/tuple of tensors) on the CPU by chunks.

If our dataset has 16 samples with a batch size of 2 on 3 processes and we gather then transfer on CPU at every step, our sampler will generate the following indices:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 0, 1]

to get something of size a multiple of 3 (so that each process gets the same dataset length). Then process 0, 1 and 2 will be responsible of making predictions for the following samples:

  • P0: [0, 1, 2, 3, 4, 5]

  • P1: [6, 7, 8, 9, 10, 11]

  • P2: [12, 13, 14, 15, 0, 1]

The first batch treated on each process will be

  • P0: [0, 1]

  • P1: [6, 7]

  • P2: [12, 13]

So if we gather at the end of the first batch, we will get a tensor (nested list/tuple of tensor) corresponding to the following indices:

[0, 1, 6, 7, 12, 13]

If we directly concatenate our results without taking any precautions, the user will then get the predictions for the indices in this order at the end of the prediction loop:

[0, 1, 6, 7, 12, 13, 2, 3, 8, 9, 14, 15, 4, 5, 10, 11, 0, 1]

For some reason, that’s not going to roll their boat. This class is there to solve that problem.

  • world_size (int) – The number of processes used in the distributed training.

  • num_samples (int) – The number of samples in our dataset.

  • make_multiple_of (int, optional) – If passed, the class assumes the datasets passed to each process are made to be a multiple of this argument (by adding samples).

  • padding_index (int, optional, defaults to -100) – The padding index to use if the arrays don’t all have the same sequence length.


Add arrays to the internal storage, Will initialize the storage to the full size at the first arrays passed so that if we’re bound to get an OOM, it happens at the beginning.


Return the properly gathered arrays and truncate to the number of samples (since the sampler added some extras to get each process a dataset of the same length).