Trainer

The Trainer and TFTrainer classes provide an API for feature-complete training in most standard use cases. It’s used in most of the example scripts.

Before instantiating your Trainer/TFTrainer, create a TrainingArguments/TFTrainingArguments to access all the points of customization during training.

The API supports distributed training on multiple GPUs/TPUs, mixed precision through NVIDIA Apex for PyTorch and tf.keras.mixed_precision for TensorFlow.

Both Trainer and TFTrainer contain the basic training loop supporting the previous features. To inject custom behavior you can subclass them and override the following methods:

  • get_train_dataloader/get_train_tfdataset – Creates the training DataLoader (PyTorch) or TF Dataset.

  • get_eval_dataloader/get_eval_tfdataset – Creates the evaulation DataLoader (PyTorch) or TF Dataset.

  • get_test_dataloader/get_test_tfdataset – Creates the test DataLoader (PyTorch) or TF Dataset.

  • log – Logs information on the various objects watching training.

  • setup_wandb – Setups wandb (see here for more information).

  • create_optimizer_and_scheduler – Setups the optimizer and learning rate scheduler if they were not passed at init.

  • compute_loss - Computes the loss on a batch of training inputs.

  • training_step – Performs a training step.

  • prediction_step – Performs an evaluation/test step.

  • run_model (TensorFlow only) – Basic pass through the model.

  • evaluate – Runs an evaluation loop and returns metrics.

  • predict – Returns predictions (with metrics if labels are available) on a test set.

Here is an example of how to customize Trainer using a custom loss function:

from transformers import Trainer
class MyTrainer(Trainer):
    def compute_loss(self, model, inputs):
        labels = inputs.pop("labels")
        outputs = models(**inputs)
        logits = outputs[0]
        return my_custom_loss(logits, labels)

Trainer

class transformers.Trainer(model: transformers.modeling_utils.PreTrainedModel = None, args: transformers.training_args.TrainingArguments = None, data_collator: Optional[NewType.<locals>.new_type] = None, train_dataset: Optional[torch.utils.data.dataset.Dataset] = None, eval_dataset: Optional[torch.utils.data.dataset.Dataset] = None, tokenizer: Optional[PreTrainedTokenizerBase] = None, model_init: Callable[transformers.modeling_utils.PreTrainedModel] = None, compute_metrics: Optional[Callable[transformers.trainer_utils.EvalPrediction, Dict]] = None, tb_writer: Optional[SummaryWriter] = None, optimizers: Tuple[torch.optim.optimizer.Optimizer, torch.optim.lr_scheduler.LambdaLR] = (None, None), **kwargs)[source]

Trainer is a simple but feature-complete training and eval loop for PyTorch, optimized for 🤗 Transformers.

Parameters
  • model (PreTrainedModel, optional) – The model to train, evaluate or use for predictions. If not provided, a model_init must be passed.

  • args (TrainingArguments, optional) – The arguments to tweak for training. Will default to a basic instance of TrainingArguments with the output_dir set to a directory named tmp_trainer in the current directory if not provided.

  • data_collator (DataCollator, optional) – The function to use to form a batch from a list of elements of train_dataset or eval_dataset. Will default to default_data_collator() if no tokenizer is provided, an instance of DataCollatorWithPadding() otherwise.

  • train_dataset (torch.utils.data.dataset.Dataset, optional) – The dataset to use for training. If it is an datasets.Dataset, columns not accepted by the model.forward() method are automatically removed.

  • eval_dataset (torch.utils.data.dataset.Dataset, optional) – The dataset to use for evaluation. If it is an datasets.Dataset, columns not accepted by the model.forward() method are automatically removed.

  • tokenizer (PreTrainedTokenizerBase, optional) – The tokenizer used to preprocess the data. If provided, will be used to automatically pad the inputs the maximum length when batching inputs, and it will be saved along the model to make it easier to rerun an interrupted training or reuse the fine-tuned model.

  • model_init (Callable[[], PreTrainedModel], optional) – A function that instantiates the model to be used. If provided, each call to train() will start from a new instance of the model as given by this function.

  • compute_metrics (Callable[[EvalPrediction], Dict], optional) – The function that will be used to compute metrics at evaluation. Must take a EvalPrediction and return a dictionary string to metric values.

  • tb_writer (SummaryWriter, optional) – Object to write to TensorBoard.

  • optimizers (Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR, optional) – A tuple containing the optimizer and the scheduler to use. Will default to an instance of AdamW on your model and a scheduler given by get_linear_schedule_with_warmup() controlled by args.

  • kwargs – Deprecated keyword arguments.

compute_loss(model, inputs)[source]

How the loss is computed by Trainer. By default, all models return the loss in the first element.

Subclass and override for custom behavior.

create_optimizer_and_scheduler(num_training_steps: int)[source]

Setup the optimizer and the learning rate scheduler.

We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the Trainer’s init through optimizers, or subclass and override this method in a subclass.

evaluate(eval_dataset: Optional[torch.utils.data.dataset.Dataset] = None) → Dict[str, float][source]

Run evaluation and returns metrics.

The calling script will be responsible for providing a method to compute metrics, as they are task-dependent (pass it to the init compute_metrics argument).

You can also subclass and override this method to inject custom behavior.

Parameters

eval_dataset (Dataset, optional) – Pass a dataset if you wish to override self.eval_dataset. If it is an datasets.Dataset, columns not accepted by the model.forward() method are automatically removed.

Returns

A dictionary containing the evaluation loss and the potential metrics computed from the predictions.

floating_point_ops(inputs: Dict[str, Union[torch.Tensor, Any]])[source]

For models that inherit from PretrainedModel, uses that method to compute the number of floating point operations for every backward + forward pass. If using another model, either implement such a method in the model or subclass and override this method.

Parameters
  • model (nn.Module) – The model to evaluate.

  • inputs (Dict[str, Union[torch.Tensor, Any]]) – The inputs and targets of the model.

Returns

The number of floating-point operations.

Return type

int

get_eval_dataloader(eval_dataset: Optional[torch.utils.data.dataset.Dataset] = None) → torch.utils.data.dataloader.DataLoader[source]

Returns the evaluation DataLoader.

Will use no sampler if self.eval_dataset is a torch.utils.data.IterableDataset, a sequential sampler (adapted to distributed training if necessary) otherwise.

Subclass and override this method if you want to inject some custom behavior.

Parameters

eval_dataset (torch.utils.data.dataset.Dataset, optional) – If provided, will override self.eval_dataset. If it is an datasets.Dataset, columns not accepted by the model.forward() method are automatically removed.

get_test_dataloader(test_dataset: torch.utils.data.dataset.Dataset) → torch.utils.data.dataloader.DataLoader[source]

Returns the test DataLoader.

Will use no sampler if test_dataset is a torch.utils.data.IterableDataset, a sequential sampler (adapted to distributed training if necessary) otherwise.

Subclass and override this method if you want to inject some custom behavior.

Parameters

eval_dataset (torch.utils.data.dataset.Dataset, optional) – The test dataset to use. If it is an datasets.Dataset, columns not accepted by the model.forward() method are automatically removed.

get_train_dataloader() → torch.utils.data.dataloader.DataLoader[source]

Returns the training DataLoader.

Will use no sampler if self.train_dataset is a torch.utils.data.IterableDataset, a random sampler (adapted to distributed training if necessary) otherwise.

Subclass and override this method if you want to inject some custom behavior.

Launch an hyperparameter search using optuna or Ray Tune. The optimized quantity is determined by compute_objectie, which defaults to a function returning the evaluation loss when no metric is provided, the sum of all metrics otherwise.

Warning

To use this method, you need to have provided a model_init when initializing your Trainer: we need to reinitialize the model at each new run. This is incompatible with the optimizers argument, so you need to subclass Trainer and override the method create_optimizer_and_scheduler() for custom optimizer/scheduler.

Parameters
  • hp_space (Callable[["optuna.Trial"], Dict[str, float]], optional) – A function that defines the hyperparameter search space. Will default to default_hp_space_optuna() or default_hp_space_ray() depending on your backend.

  • compute_objective (Callable[[Dict[str, float]], float], optional) – A function computing the objective to minimize or maximize from the metrics returned by the evaluate method. Will default to default_compute_objective().

  • n_trials (int, optional, defaults to 100) – The number of trial runs to test.

  • direction (str, optional, defaults to "minimize") – Whether to optimize greater or lower objects. Can be "minimize" or "maximize", you should pick "minimize" when optimizing the validation loss, "maximize" when optimizing one or several metrics.

  • backend (str or HPSearchBackend, optional) – The backend to use for hyperparameter search. Will default to optuna or Ray Tune, depending on which one is installed. If both are installed, will default to optuna.

  • kwargs

    Additional keyword arguments passed along to optuna.create_study or ray.tune.run. For more information see:

Returns

All the informations about the best run.

Return type

transformers.trainer_utils.BestRun

is_local_master() → bool[source]

Whether or not this process is the local (e.g., on one machine if training in a distributed fashion on several machines) main process.

Warning

This method is deprecated, use is_local_process_zero() instead.

is_local_process_zero() → bool[source]

Whether or not this process is the local (e.g., on one machine if training in a distributed fashion on several machines) main process.

is_world_master() → bool[source]

Whether or not this process is the global main process (when training in a distributed fashion on several machines, this is only going to be True for one process).

Warning

This method is deprecated, use is_world_process_zero() instead.

is_world_process_zero() → bool[source]

Whether or not this process is the global main process (when training in a distributed fashion on several machines, this is only going to be True for one process).

log(logs: Dict[str, float], iterator: Optional[tqdm.asyncio.tqdm_asyncio] = None) → None[source]

Log logs on the various objects watching training.

Subclass and override this method to inject custom behavior.

Parameters
  • logs (Dict[str, float]) – The values to log.

  • iterator (tqdm, optional) – A potential tqdm progress bar to write the logs on.

num_examples(dataloader: torch.utils.data.dataloader.DataLoader) → int[source]

Helper to get number of samples in a DataLoader by accessing its dataset.

predict(test_dataset: torch.utils.data.dataset.Dataset) → transformers.trainer_utils.PredictionOutput[source]

Run prediction and returns predictions and potential metrics.

Depending on the dataset and your use case, your test dataset may contain labels. In that case, this method will also return metrics, like in evaluate().

Parameters

test_dataset (Dataset) – Dataset to run the predictions on. If it is an datasets.Dataset, columns not accepted by the model.forward() method are automatically removed.

Returns

predictions (np.ndarray):

The predictions on test_dataset.

label_ids (np.ndarray, optional):

The labels (if the dataset contained some).

metrics (Dict[str, float], optional):

The potential dictionary of metrics (if the dataset contained labels).

Return type

NamedTuple

prediction_loop(dataloader: torch.utils.data.dataloader.DataLoader, description: str, prediction_loss_only: Optional[bool] = None) → transformers.trainer_utils.PredictionOutput[source]

Prediction/evaluation loop, shared by Trainer.evaluate() and Trainer.predict().

Works both with or without labels.

prediction_step(model: torch.nn.modules.module.Module, inputs: Dict[str, Union[torch.Tensor, Any]], prediction_loss_only: bool) → Tuple[Optional[float], Optional[torch.Tensor], Optional[torch.Tensor]][source]

Perform an evaluation step on model using obj:inputs.

Subclass and override to inject custom behavior.

Parameters
  • model (nn.Module) – The model to evaluate.

  • inputs (Dict[str, Union[torch.Tensor, Any]]) –

    The inputs and targets of the model.

    The dictionary will be unpacked before being fed to the model. Most models expect the targets under the argument labels. Check your model’s documentation for all accepted arguments.

  • prediction_loss_only (bool) – Whether or not to return the loss only.

Returns

A tuple with the loss, logits and labels (each being optional).

Return type

Tuple[Optional[float], Optional[torch.Tensor], Optional[torch.Tensor]]

save_model(output_dir: Optional[str] = None)[source]

Will save the model, so you can reload it using from_pretrained().

Will only save from the world_master process (unless in TPUs).

setup_comet()[source]

Setup the optional Comet.ml integration.

Environment:
COMET_MODE:

(Optional): str - “OFFLINE”, “ONLINE”, or “DISABLED”

COMET_PROJECT_NAME:

(Optional): str - Comet.ml project name for experiments

COMET_OFFLINE_DIRECTORY:

(Optional): str - folder to use for saving offline experiments when COMET_MODE is “OFFLINE”

For a number of configurable items in the environment, see here

setup_wandb()[source]

Setup the optional Weights & Biases (wandb) integration.

One can subclass and override this method to customize the setup if needed. Find more information here. You can also override the following environment variables:

Environment:
WANDB_WATCH:

(Optional, [“gradients”, “all”, “false”]) “gradients” by default, set to “false” to disable gradient logging or “all” to log gradients and parameters

WANDB_PROJECT:

(Optional): str - “huggingface” by default, set this to a custom string to store results in a different project

WANDB_DISABLED:

(Optional): boolean - defaults to false, set to “true” to disable wandb entirely

train(model_path: Optional[str] = None, trial: Union[optuna.Trial, Dict[str, Any]] = None)[source]

Main training entry point.

Parameters
  • model_path (str, optional) – Local path to the model if the model to train has been instantiated from a local path. If present, training will resume from the optimizer/scheduler states loaded here.

  • trial (optuna.Trial or Dict[str, Any], optional) – The trial run or the hyperparameter dictionary for hyperparameter search.

training_step(model: torch.nn.modules.module.Module, inputs: Dict[str, Union[torch.Tensor, Any]]) → torch.Tensor[source]

Perform a training step on a batch of inputs.

Subclass and override to inject custom behavior.

Parameters
  • model (nn.Module) – The model to train.

  • inputs (Dict[str, Union[torch.Tensor, Any]]) –

    The inputs and targets of the model.

    The dictionary will be unpacked before being fed to the model. Most models expect the targets under the argument labels. Check your model’s documentation for all accepted arguments.

Returns

The tensor with training loss on this batch.

Return type

torch.Tensor

TFTrainer

class transformers.TFTrainer(model: transformers.modeling_tf_utils.TFPreTrainedModel, args: transformers.training_args_tf.TFTrainingArguments, train_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None, eval_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None, compute_metrics: Optional[Callable[transformers.trainer_utils.EvalPrediction, Dict]] = None, tb_writer: Optional[tensorflow.python.ops.summary_ops_v2.SummaryWriter] = None, optimizers: Tuple[tensorflow.python.keras.optimizer_v2.optimizer_v2.OptimizerV2, tensorflow.python.keras.optimizer_v2.learning_rate_schedule.LearningRateSchedule] = None, None, **kwargs)[source]

TFTrainer is a simple but feature-complete training and eval loop for TensorFlow, optimized for 🤗 Transformers.

Parameters
  • model (TFPreTrainedModel) – The model to train, evaluate or use for predictions.

  • args (TFTrainingArguments) – The arguments to tweak training.

  • train_dataset (Dataset, optional) – The dataset to use for training. The dataset should yield tuples of (features, labels) where features is a dict of input features and labels is the labels. If labels is a tensor, the loss is calculated by the model by calling model(features, labels=labels). If labels is a dict, such as when using a QuestionAnswering head model with multiple targets, the loss is instead calculated by calling model(features, **labels).

  • eval_dataset (Dataset, optional) – The dataset to use for evaluation. The dataset should yield tuples of (features, labels) where features is a dict of input features and labels is the labels. If labels is a tensor, the loss is calculated by the model by calling model(features, labels=labels). If labels is a dict, such as when using a QuestionAnswering head model with multiple targets, the loss is instead calculated by calling model(features, **labels).

  • compute_metrics (Callable[[EvalPrediction], Dict], optional) – The function that will be used to compute metrics at evaluation. Must take a EvalPrediction and return a dictionary string to metric values.

  • tb_writer (tf.summary.SummaryWriter, optional) – Object to write to TensorBoard.

  • optimizers (Tuple[tf.keras.optimizers.Optimizer, tf.keras.optimizers.schedules.LearningRateSchedule], optional) – A tuple containing the optimizer and the scheduler to use. The optimizer default to an instance of tf.keras.optimizers.Adam if args.weight_decay_rate is 0 else an instance of AdamWeightDecay. The scheduler will default to an instance of tf.keras.optimizers.schedules.PolynomialDecay if args.num_warmup_steps is 0 else an instance of WarmUp.

  • kwargs – Deprecated keyword arguments.

create_optimizer_and_scheduler(num_training_steps: int)[source]

Setup the optimizer and the learning rate scheduler.

We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the TFTrainer’s init through optimizers, or subclass and override this method.

evaluate(eval_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None) → Dict[str, float][source]

Run evaluation and returns metrics.

The calling script will be responsible for providing a method to compute metrics, as they are task-dependent (pass it to the init compute_metrics argument).

Parameters

eval_dataset (Dataset, optional) – Pass a dataset if you wish to override self.eval_dataset. The dataset should yield tuples of (features, labels) where features is a dict of input features and labels is the labels. If labels is a tensor, the loss is calculated by the model by calling model(features, labels=labels). If labels is a dict, such as when using a QuestionAnswering head model with multiple targets, the loss is instead calculated by calling model(features, **labels).

Returns

A dictionary containing the evaluation loss and the potential metrics computed from the predictions.

get_eval_tfdataset(eval_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None) → tensorflow.python.data.ops.dataset_ops.DatasetV2[source]

Returns the evaluation Dataset.

Parameters

eval_dataset (Dataset, optional) – If provided, will override self.eval_dataset. The dataset should yield tuples of (features, labels) where features is a dict of input features and labels is the labels. If labels is a tensor, the loss is calculated by the model by calling model(features, labels=labels). If labels is a dict, such as when using a QuestionAnswering head model with multiple targets, the loss is instead calculated by calling model(features, **labels).

Subclass and override this method if you want to inject some custom behavior.

get_test_tfdataset(test_dataset: tensorflow.python.data.ops.dataset_ops.DatasetV2) → tensorflow.python.data.ops.dataset_ops.DatasetV2[source]

Returns a test Dataset.

Parameters

test_dataset (Dataset) – The dataset to use. The dataset should yield tuples of (features, labels) where features is a dict of input features and labels is the labels. If labels is a tensor, the loss is calculated by the model by calling model(features, labels=labels). If labels is a dict, such as when using a QuestionAnswering head model with multiple targets, the loss is instead calculated by calling model(features, **labels).

Subclass and override this method if you want to inject some custom behavior.

get_train_tfdataset() → tensorflow.python.data.ops.dataset_ops.DatasetV2[source]

Returns the training Dataset.

Subclass and override this method if you want to inject some custom behavior.

log(logs: Dict[str, float]) → None[source]

Log logs on the various objects watching training.

Subclass and override this method to inject custom behavior.

Parameters

logs (Dict[str, float]) – The values to log.

predict(test_dataset: tensorflow.python.data.ops.dataset_ops.DatasetV2) → transformers.trainer_utils.PredictionOutput[source]

Run prediction and returns predictions and potential metrics.

Depending on the dataset and your use case, your test dataset may contain labels. In that case, this method will also return metrics, like in evaluate().

Parameters

test_dataset (Dataset) – Dataset to run the predictions on. The dataset should yield tuples of (features, labels) where features is a dict of input features and labels is the labels. If labels is a tensor, the loss is calculated by the model by calling model(features, labels=labels). If labels is a dict, such as when using a QuestionAnswering head model with multiple targets, the loss is instead calculated by calling model(features, **labels).

Returns

predictions (np.ndarray):

The predictions on test_dataset.

label_ids (np.ndarray, optional):

The labels (if the dataset contained some).

metrics (Dict[str, float], optional):

The potential dictionary of metrics (if the dataset contained labels).

Return type

NamedTuple

prediction_loop(dataset: tensorflow.python.data.ops.dataset_ops.DatasetV2, steps: int, num_examples: int, description: str, prediction_loss_only: Optional[bool] = None) → transformers.trainer_utils.PredictionOutput[source]

Prediction/evaluation loop, shared by evaluate() and predict().

Works both with or without labels.

prediction_step(features: tensorflow.python.framework.ops.Tensor, labels: tensorflow.python.framework.ops.Tensor, nb_instances_in_global_batch: tensorflow.python.framework.ops.Tensor) → tensorflow.python.framework.ops.Tensor[source]

Compute the prediction on features and update the loss with labels.

Subclass and override to inject some custom behavior.

run_model(features, labels, training)[source]

Computes the loss of the given features and labels pair.

Subclass and override this method if you want to inject some custom behavior.

Parameters
  • features (tf.Tensor) – A batch of input features.

  • labels (tf.Tensor) – A batch of labels.

  • training (bool) – Whether or not to run the model in training mode.

Returns

The loss and logits.

Return type

A tuple of two tf.Tensor

save_model(output_dir: Optional[str] = None)[source]

Will save the model, so you can reload it using from_pretrained().

setup_comet()[source]

Setup the optional Comet.ml integration.

Environment:
COMET_MODE:

(Optional): str - “OFFLINE”, “ONLINE”, or “DISABLED”

COMET_PROJECT_NAME:

(Optional): str - Comet.ml project name for experiments

COMET_OFFLINE_DIRECTORY:

(Optional): str - folder to use for saving offline experiments when COMET_MODE is “OFFLINE”

For a number of configurable items in the environment, see here

setup_wandb()[source]

Setup the optional Weights & Biases (wandb) integration.

One can subclass and override this method to customize the setup if needed. Find more information here. You can also override the following environment variables:

Environment:
WANDB_PROJECT:

(Optional): str - “huggingface” by default, set this to a custom string to store results in a different project

WANDB_DISABLED:

(Optional): boolean - defaults to false, set to “true” to disable wandb entirely

train() → None[source]

Train method to train the model.

training_step(features, labels, nb_instances_in_global_batch)[source]

Perform a training step on features and labels.

Subclass and override to inject some custom behavior.

TrainingArguments

class transformers.TrainingArguments(output_dir: str, overwrite_output_dir: bool = False, do_train: bool = False, do_eval: bool = False, do_predict: bool = False, evaluate_during_training: bool = None, evaluation_strategy: transformers.trainer_utils.EvaluationStrategy = 'no', prediction_loss_only: bool = False, per_device_train_batch_size: int = 8, per_device_eval_batch_size: int = 8, per_gpu_train_batch_size: Optional[int] = None, per_gpu_eval_batch_size: Optional[int] = None, gradient_accumulation_steps: int = 1, learning_rate: float = 5e-05, weight_decay: float = 0.0, adam_beta1: float = 0.9, adam_beta2: float = 0.999, adam_epsilon: float = 1e-08, max_grad_norm: float = 1.0, num_train_epochs: float = 3.0, max_steps: int = -1, warmup_steps: int = 0, logging_dir: Optional[str] = <factory>, logging_first_step: bool = False, logging_steps: int = 500, save_steps: int = 500, save_total_limit: Optional[int] = None, no_cuda: bool = False, seed: int = 42, fp16: bool = False, fp16_opt_level: str = 'O1', local_rank: int = -1, tpu_num_cores: Optional[int] = None, tpu_metrics_debug: bool = False, debug: bool = False, dataloader_drop_last: bool = False, eval_steps: int = None, dataloader_num_workers: int = 0, past_index: int = -1, run_name: Optional[str] = None, disable_tqdm: Optional[bool] = None, remove_unused_columns: Optional[bool] = True, label_names: Optional[List[str]] = None)[source]

TrainingArguments is the subset of the arguments we use in our example scripts which relate to the training loop itself.

Using HfArgumentParser we can turn this class into argparse arguments to be able to specify them on the command line.

Parameters
  • output_dir (str) – The output directory where the model predictions and checkpoints will be written.

  • overwrite_output_dir (bool, optional, defaults to False) – If True, overwrite the content of the output directory. Use this to continue training if output_dir points to a checkpoint directory.

  • do_train (bool, optional, defaults to False) – Whether to run training or not.

  • do_eval (bool, optional, defaults to False) – Whether to run evaluation on the dev set or not.

  • do_predict (bool, optional, defaults to False) – Whether to run predictions on the test set or not.

  • evaluation_strategy (str or EvaluationStrategy, optional, defaults to "no") –

    The evaluation strategy to adopt during training. Possible values are:

    • "no": No evaluation is done during training.

    • "steps": Evaluation is done (and logged) every eval_steps.

    • "epoch": Evaluation is done at the end of each epoch.

  • prediction_loss_only (bool, optional, defaults to False) – When performing evaluation and predictions, only returns the loss.

  • per_device_train_batch_size (int, optional, defaults to 8) – The batch size per GPU/TPU core/CPU for training.

  • per_device_eval_batch_size (int, optional, defaults to 8) – The batch size per GPU/TPU core/CPU for evaluation.

  • gradient_accumulation_steps

    (int, optional, defaults to 1): Number of updates steps to accumulate the gradients for, before performing a backward/update pass.

    Warning

    When using gradient accumulation, one step is counted as one step with backward pass. Therefore, logging, evaluation, save will be conducted every gradient_accumulation_steps * xxx_step training examples.

  • learning_rate (float, optional, defaults to 5e-5) – The initial learning rate for Adam.

  • weight_decay (float, optional, defaults to 0) – The weight decay to apply (if not zero).

  • adam_epsilon (float, optional, defaults to 1e-8) – Epsilon for the Adam optimizer.

  • max_grad_norm (float, optional, defaults to 1.0) – Maximum gradient norm (for gradient clipping).

  • num_train_epochs (float, optional, defaults to 3.0) – Total number of training epochs to perform (if not an integer, will perform the decimal part percents of the last epoch before stopping training).

  • max_steps (int, optional, defaults to -1) – If set to a positive number, the total number of training steps to perform. Overrides num_train_epochs.

  • warmup_steps (int, optional, defaults to 0) – Number of steps used for a linear warmup from 0 to learning_rate.

  • logging_dir (str, optional) – Tensorboard log directory. Will default to runs/**CURRENT_DATETIME_HOSTNAME**.

  • logging_first_step (bool, optional, defaults to False) – Wheter to log and evalulate the first global_step or not.

  • logging_steps (int, optional, defaults to 500) – Number of update steps between two logs.

  • save_steps (int, optional, defaults to 500) – Number of updates steps before two checkpoint saves.

  • save_total_limit (int, optional) – If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in output_dir.

  • no_cuda (bool, optional, defaults to False) – Whether to not use CUDA even when it is available or not.

  • seed (int, optional, defaults to 42) – Random seed for initialization.

  • fp16 (bool, optional, defaults to False) – Whether to use 16-bit (mixed) precision training (through NVIDIA apex) instead of 32-bit training.

  • fp16_opt_level (str, optional, defaults to ‘O1’) – For fp16 training, apex AMP optimization level selected in [‘O0’, ‘O1’, ‘O2’, and ‘O3’]. See details on the apex documentation.

  • local_rank (int, optional, defaults to -1) – During distributed training, the rank of the process.

  • tpu_num_cores (int, optional) – When training on TPU, the mumber of TPU cores (automatically passed by launcher script).

  • debug (bool, optional, defaults to False) – When training on TPU, whether to print debug metrics or not.

  • dataloader_drop_last (bool, optional, defaults to False) – Whether to drop the last incomplete batch (if the length of the dataset is not divisible by the batch size) or not.

  • eval_steps (int, optional) – Number of update steps between two evaluations if evaluation_strategy="steps". Will default to the same value as logging_steps if not set.

  • dataloader_num_workers (int, optional, defaults to 0) – Number of subprocesses to use for data loading (PyTorch only). 0 means that the data will be loaded in the main process.

  • past_index (int, optional, defaults to -1) – Some models like TransformerXL or :doc`XLNet <../model_doc/xlnet>` can make use of the past hidden states for their predictions. If this argument is set to a positive int, the Trainer will use the corresponding output (usually index 2) as the past state and feed it to the model at the next training step under the keyword argument mems.

  • run_name (str, optional) – A descriptor for the run. Notably used for wandb logging.

  • disable_tqdm (bool, optional) – Whether or not to disable the tqdm progress bars. Will default to True if the logging level is set to warn or lower (default), False otherwise.

  • remove_unused_columns (bool, optional, defaults to True) –

    If using nlp.Dataset datasets, whether or not to automatically remove the columns unused by the model forward method.

    (Note: this behavior is not implemented for TFTrainer yet.)

  • label_names (List[str], optional) –

    The list of keys in your dictionary of inputs that correspond to the labels.

    Will eventually default to ["labels"] except if the model used is one of the XxxForQuestionAnswering in which case it will default to ["start_positions", "end_positions"].

property device

The device used by this process.

property eval_batch_size

The actual batch size for evaluation (may differ from per_gpu_eval_batch_size in distributed training).

property n_gpu

The number of GPUs used by this process.

Note

This will only be greater than one when you have multiple GPUs available but are not using distributed training. For distributed training, it will always be 1.

to_dict()[source]

Serializes this instance while replace Enum by their values (for JSON serialization support).

to_json_string()[source]

Serializes this instance to a JSON string.

to_sanitized_dict() → Dict[str, Any][source]

Sanitized serialization to use with TensorBoard’s hparams

property train_batch_size

The actual batch size for training (may differ from per_gpu_train_batch_size in distributed training).

TFTrainingArguments

class transformers.TFTrainingArguments(output_dir: str, overwrite_output_dir: bool = False, do_train: bool = False, do_eval: bool = False, do_predict: bool = False, evaluate_during_training: bool = None, evaluation_strategy: transformers.trainer_utils.EvaluationStrategy = 'no', prediction_loss_only: bool = False, per_device_train_batch_size: int = 8, per_device_eval_batch_size: int = 8, per_gpu_train_batch_size: Optional[int] = None, per_gpu_eval_batch_size: Optional[int] = None, gradient_accumulation_steps: int = 1, learning_rate: float = 5e-05, weight_decay: float = 0.0, adam_beta1: float = 0.9, adam_beta2: float = 0.999, adam_epsilon: float = 1e-08, max_grad_norm: float = 1.0, num_train_epochs: float = 3.0, max_steps: int = -1, warmup_steps: int = 0, logging_dir: Optional[str] = <factory>, logging_first_step: bool = False, logging_steps: int = 500, save_steps: int = 500, save_total_limit: Optional[int] = None, no_cuda: bool = False, seed: int = 42, fp16: bool = False, fp16_opt_level: str = 'O1', local_rank: int = -1, tpu_num_cores: Optional[int] = None, tpu_metrics_debug: bool = False, debug: bool = False, dataloader_drop_last: bool = False, eval_steps: int = None, dataloader_num_workers: int = 0, past_index: int = -1, run_name: Optional[str] = None, disable_tqdm: Optional[bool] = None, remove_unused_columns: Optional[bool] = True, label_names: Optional[List[str]] = None, tpu_name: str = None, xla: bool = False)[source]

TrainingArguments is the subset of the arguments we use in our example scripts which relate to the training loop itself.

Using HfArgumentParser we can turn this class into argparse arguments to be able to specify them on the command line.

Parameters
  • output_dir (str) – The output directory where the model predictions and checkpoints will be written.

  • overwrite_output_dir (bool, optional, defaults to False) – If True, overwrite the content of the output directory. Use this to continue training if output_dir points to a checkpoint directory.

  • do_train (bool, optional, defaults to False) – Whether to run training or not.

  • do_eval (bool, optional, defaults to False) – Whether to run evaluation on the dev set or not.

  • do_predict (bool, optional, defaults to False) – Whether to run predictions on the test set or not.

  • evaluate_during_training (bool, optional, defaults to False) – Whether to run evaluation during training at each logging step or not.

  • per_device_train_batch_size (int, optional, defaults to 8) – The batch size per GPU/TPU core/CPU for training.

  • per_device_eval_batch_size (int, optional, defaults to 8) – The batch size per GPU/TPU core/CPU for evaluation.

  • gradient_accumulation_steps

    (int, optional, defaults to 1): Number of updates steps to accumulate the gradients for, before performing a backward/update pass.

    Warning

    When using gradient accumulation, one step is counted as one step with backward pass. Therefore, logging, evaluation, save will be conducted every gradient_accumulation_steps * xxx_step training examples.

  • learning_rate (float, optional, defaults to 5e-5) – The initial learning rate for Adam.

  • weight_decay (float, optional, defaults to 0) – The weight decay to apply (if not zero).

  • adam_epsilon (float, optional, defaults to 1e-8) – Epsilon for the Adam optimizer.

  • max_grad_norm (float, optional, defaults to 1.0) – Maximum gradient norm (for gradient clipping).

  • num_train_epochs (float, optional, defaults to 3.0) – Total number of training epochs to perform.

  • max_steps (int, optional, defaults to -1) – If set to a positive number, the total number of training steps to perform. Overrides num_train_epochs.

  • warmup_steps (int, optional, defaults to 0) – Number of steps used for a linear warmup from 0 to learning_rate.

  • logging_dir (str, optional) – Tensorboard log directory. Will default to runs/**CURRENT_DATETIME_HOSTNAME**.

  • logging_first_step (bool, optional, defaults to False) – Wheter to log and evalulate the first global_step or not.

  • logging_steps (int, optional, defaults to 500) – Number of update steps between two logs.

  • save_steps (int, optional, defaults to 500) – Number of updates steps before two checkpoint saves.

  • save_total_limit (int, optional) – If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in output_dir.

  • no_cuda (bool, optional, defaults to False) – Whether to not use CUDA even when it is available or not.

  • seed (int, optional, defaults to 42) – Random seed for initialization.

  • fp16 (bool, optional, defaults to False) – Whether to use 16-bit (mixed) precision training (through NVIDIA apex) instead of 32-bit training.

  • fp16_opt_level (str, optional, defaults to ‘O1’) – For fp16 training, apex AMP optimization level selected in [‘O0’, ‘O1’, ‘O2’, and ‘O3’]. See details on the apex documentation.

  • local_rank (int, optional, defaults to -1) – During distributed training, the rank of the process.

  • tpu_num_cores (int, optional) – When training on TPU, the mumber of TPU cores (automatically passed by launcher script).

  • debug (bool, optional, defaults to False) – Whether to activate the trace to record computation graphs and profiling information or not.

  • dataloader_drop_last (bool, optional, defaults to False) – Whether to drop the last incomplete batch (if the length of the dataset is not divisible by the batch size) or not.

  • eval_steps (int, optional, defaults to 1000) – Number of update steps before two evaluations.

  • past_index (int, optional, defaults to -1) – Some models like TransformerXL or :doc`XLNet <../model_doc/xlnet>` can make use of the past hidden states for their predictions. If this argument is set to a positive int, the Trainer will use the corresponding output (usually index 2) as the past state and feed it to the model at the next training step under the keyword argument mems.

  • tpu_name (str, optional) – The name of the TPU the process is running on.

  • run_name (str, optional) – A descriptor for the run. Notably used for wandb logging.

  • xla (bool, optional) – Whether to activate the XLA compilation or not.

property eval_batch_size

The actual batch size for evaluation (may differ from per_gpu_eval_batch_size in distributed training).

property n_gpu

The number of replicas (CPUs, GPUs or TPU cores) used in this training.

property n_replicas

The number of replicas (CPUs, GPUs or TPU cores) used in this training.

property strategy

The strategy used for distributed training.

property train_batch_size

The actual batch size for training (may differ from per_gpu_train_batch_size in distributed training).

Utilities

class transformers.EvalPrediction(predictions: Union[numpy.ndarray, Tuple[numpy.ndarray]], label_ids: numpy.ndarray)[source]

Evaluation output (always contains labels), to be used to compute metrics.

Parameters
  • predictions (np.ndarray) – Predictions of the model.

  • label_ids (np.ndarray) – Targets to be matched.

transformers.set_seed(seed: int)[source]

Helper function for reproducible behavior to set the seed in random, numpy, torch and/or tf (if installed).

Parameters

seed (int) – The seed to set.

transformers.torch_distributed_zero_first(local_rank: int)[source]

Decorator to make all processes in distributed training wait for each local_master to do something.

Parameters

local_rank (int) – The rank of the local process.