Trainer¶

The Trainer and TFTrainer classes provide an API for feature-complete training in most standard use cases. It’s used in most of the example scripts.

Before instantiating your Trainer/TFTrainer, create a TrainingArguments/TFTrainingArguments to access all the points of customization during training.

The API supports distributed training on multiple GPUs/TPUs, mixed precision through NVIDIA Apex for PyTorch and tf.keras.mixed_precision for TensorFlow.

Trainer¶

class transformers.Trainer(model: transformers.modeling_utils.PreTrainedModel, args: transformers.training_args.TrainingArguments, data_collator: Optional[NewType.<locals>.new_type] = None, train_dataset: Optional[torch.utils.data.dataset.Dataset] = None, eval_dataset: Optional[torch.utils.data.dataset.Dataset] = None, compute_metrics: Optional[Callable[transformers.trainer_utils.EvalPrediction, Dict]] = None, prediction_loss_only=False, tb_writer: Optional[SummaryWriter] = None, optimizers: Tuple[torch.optim.optimizer.Optimizer, torch.optim.lr_scheduler.LambdaLR] = None)[source]¶

Trainer is a simple but feature-complete training and eval loop for PyTorch, optimized for 🤗 Transformers.

Parameters
  • model (PreTrainedModel) – The model to train, evaluate or use for predictions.

  • args (TrainingArguments) – The arguments to tweak training.

  • data_collator (DataCollator, optional, defaults to default_data_collator()) – The function to use to from a batch from a list of elements of train_dataset or eval_dataset.

  • train_dataset (Dataset, optional) – The dataset to use for training.

  • eval_dataset (Dataset, optional) – The dataset to use for evaluation.

  • compute_metrics (Callable[[EvalPrediction], Dict], optional) – The function that will be used to compute metrics at evaluation. Must take a EvalPrediction and return a dictionary string to metric values.

  • prediction_loss_only (bool, optional, defaults to False) – When performing evaluation and predictions, only returns the loss.

  • tb_writer (SummaryWriter, optional) – Object to write to TensorBoard.

  • optimizers (Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR, optional) – A tuple containing the optimizer and the scheduler to use. Will default to an instance of AdamW on your model and a scheduler given by get_linear_schedule_with_warmup() controlled by args.

evaluate(eval_dataset: Optional[torch.utils.data.dataset.Dataset] = None) → Dict[str, float][source]¶

Run evaluation and returns metrics.

The calling script will be responsible for providing a method to compute metrics, as they are task-dependent (pass it to the init compute_metrics argument).

Parameters

eval_dataset (Dataset, optional) – Pass a dataset if you wish to override self.eval_dataset.

Returns

A dictionary containing the evaluation loss and the potential metrics computed from the predictions.

get_eval_dataloader(eval_dataset: Optional[torch.utils.data.dataset.Dataset] = None) → torch.utils.data.dataloader.DataLoader[source]¶

Returns the evaluation DataLoader.

Parameters

eval_dataset (Dataset, optional) – If provided, will override self.eval_dataset.

get_optimizers(num_training_steps: int) → Tuple[torch.optim.optimizer.Optimizer, torch.optim.lr_scheduler.LambdaLR][source]¶

Setup the optimizer and the learning rate scheduler.

We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the Trainer’s init through optimizers, or override this method in a subclass.

get_test_dataloader(test_dataset: torch.utils.data.dataset.Dataset) → torch.utils.data.dataloader.DataLoader[source]¶

Returns the test DataLoader.

Parameters

(obj (test_dataset) – Dataset): The test dataset to use.

get_train_dataloader() → torch.utils.data.dataloader.DataLoader[source]¶

Returns the training DataLoader.

is_world_master() → bool[source]¶

This will be True only in one process, even in distributed mode, even when training on multiple machines.

num_examples(dataloader: torch.utils.data.dataloader.DataLoader) → int[source]¶

Helper to get number of samples in a DataLoader by accessing its Dataset.

predict(test_dataset: torch.utils.data.dataset.Dataset) → transformers.trainer_utils.PredictionOutput[source]¶

Run prediction and returns predictions and potential metrics.

Depending on the dataset and your use case, your test dataset may contain labels. In that case, this method will also return metrics, like in evaluate().

Parameters

test_dataset (Dataset) – Dataset to run the predictions on.

Returns

predictions (np.ndarray):

The predictions on test_dataset.

label_ids (np.ndarray, optional):

The labels (if the dataset contained some).

metrics (Dict[str, float], optional):

The potential dictionary of metrics (if the dataset contained labels).

Return type

NamedTuple

save_model(output_dir: Optional[str] = None)[source]¶

Will save the model, so you can reload it using from_pretrained().

Will only save from the world_master process (unless in TPUs).

train(model_path: Optional[str] = None)[source]¶

Main training entry point.

Parameters

model_path (str, optional) – Local path to the model if the model to train has been instantiated from a local path. If present, training will resume from the optimizer/scheduler states loaded here.

TFTrainer¶

class transformers.TFTrainer(model: transformers.modeling_tf_utils.TFPreTrainedModel, args: transformers.training_args_tf.TFTrainingArguments, train_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None, eval_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None, compute_metrics: Optional[Callable[transformers.trainer_utils.EvalPrediction, Dict]] = None, prediction_loss_only=False, tb_writer: Optional[tensorflow.python.ops.summary_ops_v2.SummaryWriter] = None, optimizers: Tuple[tensorflow.python.keras.optimizer_v2.optimizer_v2.OptimizerV2, tensorflow.python.keras.optimizer_v2.learning_rate_schedule.LearningRateSchedule] = None)[source]¶

TFTrainer is a simple but feature-complete training and eval loop for TensorFlow, optimized for 🤗 Transformers.

Parameters
  • model (TFPreTrainedModel) – The model to train, evaluate or use for predictions.

  • args (TFTrainingArguments) – The arguments to tweak training.

  • train_dataset (Dataset, optional) – The dataset to use for training.

  • eval_dataset (Dataset, optional) – The dataset to use for evaluation.

  • compute_metrics (Callable[[EvalPrediction], Dict], optional) – The function that will be used to compute metrics at evaluation. Must take a EvalPrediction and return a dictionary string to metric values.

  • prediction_loss_only (bool, optional, defaults to False) – When performing evaluation and predictions, only returns the loss.

  • tb_writer (tf.summary.SummaryWriter, optional) – Object to write to TensorBoard.

  • optimizers (Tuple[tf.keras.optimizers.Optimizer, tf.keras.optimizers.schedules.LearningRateSchedule], optional) – A tuple containing the optimizer and the scheduler to use. The optimizer default to an instance of tf.keras.optimizers.Adam if args.weight_decay_rate is 0 else an instance of AdamWeightDecay. The scheduler will default to an instance of tf.keras.optimizers.schedules.PolynomialDecay if args.num_warmup_steps is 0 else an instance of WarmUp.

evaluate(eval_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None) → Dict[str, float][source]¶

Run evaluation and returns metrics.

The calling script will be responsible for providing a method to compute metrics, as they are task-dependent (pass it to the init compute_metrics argument).

Parameters

eval_dataset (Dataset, optional) – Pass a dataset if you wish to override self.eval_dataset.

Returns

A dictionary containing the evaluation loss and the potential metrics computed from the predictions.

get_eval_tfdataset(eval_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None) → tensorflow.python.data.ops.dataset_ops.DatasetV2[source]¶

Returns the evaluation Dataset.

Parameters

eval_dataset (Dataset, optional) – If provided, will override self.eval_dataset.

get_optimizers(num_training_steps: int) → Tuple[tensorflow.python.keras.optimizer_v2.optimizer_v2.OptimizerV2, tensorflow.python.keras.optimizer_v2.learning_rate_schedule.LearningRateSchedule][source]¶

Setup the optimizer and the learning rate scheduler.

We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the TFTrainer’s init through optimizers, or override this method in a subclass.

get_test_tfdataset(test_dataset: tensorflow.python.data.ops.dataset_ops.DatasetV2) → tensorflow.python.data.ops.dataset_ops.DatasetV2[source]¶

Returns a test Dataset.

Parameters

test_dataset (Dataset) – The dataset to use.

get_train_tfdataset() → tensorflow.python.data.ops.dataset_ops.DatasetV2[source]¶

Returns the training Dataset.

predict(test_dataset: tensorflow.python.data.ops.dataset_ops.DatasetV2) → transformers.trainer_utils.PredictionOutput[source]¶

Run prediction and returns predictions and potential metrics.

Depending on the dataset and your use case, your test dataset may contain labels. In that case, this method will also return metrics, like in evaluate().

Parameters

test_dataset (Dataset) – Dataset to run the predictions on.

Returns

predictions (np.ndarray):

The predictions on test_dataset.

label_ids (np.ndarray, optional):

The labels (if the dataset contained some).

metrics (Dict[str, float], optional):

The potential dictionary of metrics (if the dataset contained labels).

Return type

NamedTuple

save_model(output_dir: Optional[str] = None)[source]¶

Will save the model, so you can reload it using from_pretrained().

train() → None[source]¶

Train method to train the model.

TrainingArguments¶

class transformers.TrainingArguments(output_dir: str, overwrite_output_dir: bool = False, do_train: bool = False, do_eval: bool = False, do_predict: bool = False, evaluate_during_training: bool = False, per_device_train_batch_size: int = 8, per_device_eval_batch_size: int = 8, per_gpu_train_batch_size: Optional[int] = None, per_gpu_eval_batch_size: Optional[int] = None, gradient_accumulation_steps: int = 1, learning_rate: float = 5e-05, weight_decay: float = 0.0, adam_epsilon: float = 1e-08, max_grad_norm: float = 1.0, num_train_epochs: float = 3.0, max_steps: int = -1, warmup_steps: int = 0, logging_dir: Optional[str] = <factory>, logging_first_step: bool = False, logging_steps: int = 500, save_steps: int = 500, save_total_limit: Optional[int] = None, no_cuda: bool = False, seed: int = 42, fp16: bool = False, fp16_opt_level: str = 'O1', local_rank: int = -1, tpu_num_cores: Optional[int] = None, tpu_metrics_debug: bool = False, debug: bool = False, dataloader_drop_last: bool = False, eval_steps: int = 1000, past_index: int = -1)[source]¶

TrainingArguments is the subset of the arguments we use in our example scripts which relate to the training loop itself.

Using HfArgumentParser we can turn this class into argparse arguments to be able to specify them on the command line.

Parameters
  • output_dir (str) – The output directory where the model predictions and checkpoints will be written.

  • overwrite_output_dir (bool, optional, defaults to False) – If True, overwrite the content of the output directory. Use this to continue training if output_dir points to a checkpoint directory.

  • do_train (bool, optional, defaults to False) – Whether to run training or not.

  • do_eval (bool, optional, defaults to False) – Whether to run evaluation on the dev set or not.

  • do_predict (bool, optional, defaults to False) – Whether to run predictions on the test set or not.

  • evaluate_during_training (bool, optional, defaults to False) – Whether to run evaluation during training at each logging step or not.

  • per_device_train_batch_size (int, optional, defaults to 8) – The batch size per GPU/TPU core/CPU for training.

  • per_device_eval_batch_size (int, optional, defaults to 8) – The batch size per GPU/TPU core/CPU for evaluation.

  • gradient_accumulation_steps – (int, optional, defaults to 1): Number of updates steps to accumulate the gradients for, before performing a backward/update pass.

  • learning_rate (float, optional, defaults to 5e-5) – The initial learning rate for Adam.

  • weight_decay (float, optional, defaults to 0) – The weight decay to apply (if not zero).

  • adam_epsilon (float, optional, defaults to 1e-8) – Epsilon for the Adam optimizer.

  • max_grad_norm (float, optional, defaults to 1.0) – Maximum gradient norm (for gradient clipping).

  • num_train_epochs (float, optional, defaults to 3.0) – Total number of training epochs to perform.

  • max_steps (int, optional, defaults to -1) – If set to a positive number, the total number of training steps to perform. Overrides num_train_epochs.

  • warmup_steps (int, optional, defaults to 0) – Number of steps used for a linear warmup from 0 to learning_rate.

  • logging_dir (str, optional) – Tensorboard log directory. Will default to runs/**CURRENT_DATETIME_HOSTNAME**.

  • logging_first_step (bool, optional, defaults to False) – Wheter to log and evalulate the first global_step or not.

  • logging_steps (int, optional, defaults to 500) – Number of update steps between two logs.

  • save_steps (int, optional, defaults to 500) – Number of updates steps before two checkpoint saves.

  • save_total_limit (int, optional) – If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in output_dir.

  • no_cuda (bool, optional, defaults to False) – Wherher to not use CUDA even when it is available or not.

  • seed (int, optional, defaults to 42) – Random seed for initialization.

  • fp16 (bool, optional, defaults to False) – Whether to use 16-bit (mixed) precision training (through NVIDIA apex) instead of 32-bit training.

  • fp16_opt_level (str, optional, defaults to ‘O1’) – For fp16 training, apex AMP optimization level selected in [‘O0’, ‘O1’, ‘O2’, and ‘O3’]. See details on the apex documentation.

  • local_rank (int, optional, defaults to -1) – During distributed training, the rank of the process.

  • tpu_num_cores (int, optional) – When training on TPU, the mumber of TPU cores (automatically passed by launcher script).

  • debug (bool, optional, defaults to False) – When training on TPU, whether to print debug metrics or not.

  • dataloader_drop_last (bool, optional, defaults to False) – Whether to drop the last incomplete batch (if the length of the dataset is not divisible by the batch size) or not.

  • eval_steps (int, optional, defaults to 1000) – Number of update steps between two evaluations.

  • past_index (int, optional, defaults to -1) – Some models like TransformerXL or :doc`XLNet <../model_doc/xlnet>` can make use of the past hidden states for their predictions. If this argument is set to a positive int, the Trainer will use the corresponding output (usually index 2) as the past state and feed it to the model at the next training step under the keyword argument mems.

property device¶

The device used by this process.

property eval_batch_size¶

The actual batch size for evaluation (may differ from per_gpu_eval_batch_size in distributed training).

property n_gpu¶

The number of GPUs used by this process.

Note

This will only be greater than one when you have multiple GPUs available but are not using distributed training. For distributed training, it will always be 1.

to_json_string()[source]¶

Serializes this instance to a JSON string.

to_sanitized_dict() → Dict[str, Any][source]¶

Sanitized serialization to use with TensorBoard’s hparams

property train_batch_size¶

The actual batch size for training (may differ from per_gpu_train_batch_size in distributed training).

TFTrainingArguments¶

class transformers.TFTrainingArguments(output_dir: str, overwrite_output_dir: bool = False, do_train: bool = False, do_eval: bool = False, do_predict: bool = False, evaluate_during_training: bool = False, per_device_train_batch_size: int = 8, per_device_eval_batch_size: int = 8, per_gpu_train_batch_size: Optional[int] = None, per_gpu_eval_batch_size: Optional[int] = None, gradient_accumulation_steps: int = 1, learning_rate: float = 5e-05, weight_decay: float = 0.0, adam_epsilon: float = 1e-08, max_grad_norm: float = 1.0, num_train_epochs: float = 3.0, max_steps: int = -1, warmup_steps: int = 0, logging_dir: Optional[str] = <factory>, logging_first_step: bool = False, logging_steps: int = 500, save_steps: int = 500, save_total_limit: Optional[int] = None, no_cuda: bool = False, seed: int = 42, fp16: bool = False, fp16_opt_level: str = 'O1', local_rank: int = -1, tpu_num_cores: Optional[int] = None, tpu_metrics_debug: bool = False, debug: bool = False, dataloader_drop_last: bool = False, eval_steps: int = 1000, past_index: int = -1, tpu_name: str = None)[source]¶

TrainingArguments is the subset of the arguments we use in our example scripts which relate to the training loop itself.

Using HfArgumentParser we can turn this class into argparse arguments to be able to specify them on the command line.

Parameters
  • output_dir (str) – The output directory where the model predictions and checkpoints will be written.

  • overwrite_output_dir (bool, optional, defaults to False) – If True, overwrite the content of the output directory. Use this to continue training if output_dir points to a checkpoint directory.

  • do_train (bool, optional, defaults to False) – Whether to run training or not.

  • do_eval (bool, optional, defaults to False) – Whether to run evaluation on the dev set or not.

  • do_predict (bool, optional, defaults to False) – Whether to run predictions on the test set or not.

  • evaluate_during_training (bool, optional, defaults to False) – Whether to run evaluation during training at each logging step or not.

  • per_device_train_batch_size (int, optional, defaults to 8) – The batch size per GPU/TPU core/CPU for training.

  • per_device_eval_batch_size (int, optional, defaults to 8) – The batch size per GPU/TPU core/CPU for evaluation.

  • gradient_accumulation_steps – (int, optional, defaults to 1): Number of updates steps to accumulate the gradients for, before performing a backward/update pass.

  • learning_rate (float, optional, defaults to 5e-5) – The initial learning rate for Adam.

  • weight_decay (float, optional, defaults to 0) – The weight decay to apply (if not zero).

  • adam_epsilon (float, optional, defaults to 1e-8) – Epsilon for the Adam optimizer.

  • max_grad_norm (float, optional, defaults to 1.0) – Maximum gradient norm (for gradient clipping).

  • num_train_epochs (float, optional, defaults to 3.0) – Total number of training epochs to perform.

  • max_steps (int, optional, defaults to -1) – If set to a positive number, the total number of training steps to perform. Overrides num_train_epochs.

  • warmup_steps (int, optional, defaults to 0) – Number of steps used for a linear warmup from 0 to learning_rate.

  • logging_dir (str, optional) – Tensorboard log directory. Will default to runs/**CURRENT_DATETIME_HOSTNAME**.

  • logging_first_step (bool, optional, defaults to False) – Wheter to log and evalulate the first global_step or not.

  • logging_steps (int, optional, defaults to 500) – Number of update steps between two logs.

  • save_steps (int, optional, defaults to 500) – Number of updates steps before two checkpoint saves.

  • save_total_limit (int, optional) – If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in output_dir.

  • no_cuda (bool, optional, defaults to False) – Wherher to not use CUDA even when it is available or not.

  • seed (int, optional, defaults to 42) – Random seed for initialization.

  • fp16 (bool, optional, defaults to False) – Whether to use 16-bit (mixed) precision training (through NVIDIA apex) instead of 32-bit training.

  • fp16_opt_level (str, optional, defaults to ‘O1’) – For fp16 training, apex AMP optimization level selected in [‘O0’, ‘O1’, ‘O2’, and ‘O3’]. See details on the apex documentation.

  • local_rank (int, optional, defaults to -1) – During distributed training, the rank of the process.

  • tpu_num_cores (int, optional) – When training on TPU, the mumber of TPU cores (automatically passed by launcher script).

  • debug (bool, optional, defaults to False) – Wheter to activate the trace to record computation graphs and profiling information or not.

  • dataloader_drop_last (bool, optional, defaults to False) – Whether to drop the last incomplete batch (if the length of the dataset is not divisible by the batch size) or not.

  • eval_steps (int, optional, defaults to 1000) – Number of update steps before two evaluations.

  • past_index (int, optional, defaults to -1) – Some models like TransformerXL or :doc`XLNet <../model_doc/xlnet>` can make use of the past hidden states for their predictions. If this argument is set to a positive int, the Trainer will use the corresponding output (usually index 2) as the past state and feed it to the model at the next training step under the keyword argument mems.

  • tpu_name (str, optional) – The name of the TPU the process is running on.

property eval_batch_size¶

The actual batch size for evaluation (may differ from per_gpu_eval_batch_size in distributed training).

property n_gpu¶

The number of replicas (CPUs, GPUs or TPU cores) used in this training.

property n_replicas¶

The number of replicas (CPUs, GPUs or TPU cores) used in this training.

property strategy¶

The strategy used for distributed training.

property train_batch_size¶

The actual batch size for training (may differ from per_gpu_train_batch_size in distributed training).

Utilities¶

class transformers.EvalPrediction(predictions: numpy.ndarray, label_ids: numpy.ndarray)[source]¶

Evaluation output (always contains labels), to be used to compute metrics.

Parameters
  • predictions (np.ndarray) – Predictions of the model.

  • label_ids (np.ndarray) – Targets to be matched.

transformers.set_seed(seed: int)[source]¶

Helper function for reproducible behavior to set the seed in random, numpy, torch and/or tf (if installed).

Parameters

seed (int) – The seed to set.

transformers.torch_distributed_zero_first(local_rank: int)[source]¶

Decorator to make all processes in distributed training wait for each local_master to do something.

Parameters

local_rank (int) – The rank of the local process.