Trainer¶
The Trainer
and TFTrainer
classes provide an API for feature-complete
training in most standard use cases. It’s used in most of the example scripts.
Before instantiating your Trainer
/TFTrainer
, create a
TrainingArguments
/TFTrainingArguments
to access all the points of
customization during training.
The API supports distributed training on multiple GPUs/TPUs, mixed precision through NVIDIA Apex for PyTorch and tf.keras.mixed_precision
for TensorFlow.
Both Trainer
and TFTrainer
contain the basic training loop supporting the
previous features. To inject custom behavior you can subclass them and override the following methods:
get_train_dataloader/get_train_tfdataset – Creates the training DataLoader (PyTorch) or TF Dataset.
get_eval_dataloader/get_eval_tfdataset – Creates the evaluation DataLoader (PyTorch) or TF Dataset.
get_test_dataloader/get_test_tfdataset – Creates the test DataLoader (PyTorch) or TF Dataset.
log – Logs information on the various objects watching training.
create_optimizer_and_scheduler – Setups the optimizer and learning rate scheduler if they were not passed at init.
compute_loss - Computes the loss on a batch of training inputs.
training_step – Performs a training step.
prediction_step – Performs an evaluation/test step.
run_model (TensorFlow only) – Basic pass through the model.
evaluate – Runs an evaluation loop and returns metrics.
predict – Returns predictions (with metrics if labels are available) on a test set.
Here is an example of how to customize Trainer
using a custom loss function:
from transformers import Trainer
class MyTrainer(Trainer):
def compute_loss(self, model, inputs):
labels = inputs.pop("labels")
outputs = models(**inputs)
logits = outputs[0]
return my_custom_loss(logits, labels)
Another way to customize the training loop behavior for the PyTorch Trainer
is to use
callbacks that can inspect the training loop state (for progress reporting, logging on TensorBoard or
other ML platforms…) and take decisions (like early stopping).
Trainer¶
-
class
transformers.
Trainer
(model: torch.nn.modules.module.Module = None, args: transformers.training_args.TrainingArguments = None, data_collator: Optional[NewType.<locals>.new_type] = None, train_dataset: Optional[torch.utils.data.dataset.Dataset] = None, eval_dataset: Optional[torch.utils.data.dataset.Dataset] = None, tokenizer: Optional[PreTrainedTokenizerBase] = None, model_init: Callable[transformers.modeling_utils.PreTrainedModel] = None, compute_metrics: Optional[Callable[transformers.trainer_utils.EvalPrediction, Dict]] = None, callbacks: Optional[List[transformers.trainer_callback.TrainerCallback]] = None, optimizers: Tuple[torch.optim.optimizer.Optimizer, torch.optim.lr_scheduler.LambdaLR] = (None, None), **kwargs)[source]¶ Trainer is a simple but feature-complete training and eval loop for PyTorch, optimized for 🤗 Transformers.
- Parameters
model (
PreTrainedModel
ortorch.nn.Module
, optional) –The model to train, evaluate or use for predictions. If not provided, a
model_init
must be passed.Note
Trainer
is optimized to work with thePreTrainedModel
provided by the library. You can still use your own models defined astorch.nn.Module
as long as they work the same way as the 🤗 Transformers models.args (
TrainingArguments
, optional) – The arguments to tweak for training. Will default to a basic instance ofTrainingArguments
with theoutput_dir
set to a directory named tmp_trainer in the current directory if not provided.data_collator (
DataCollator
, optional) – The function to use to form a batch from a list of elements oftrain_dataset
oreval_dataset
. Will default todefault_data_collator()
if notokenizer
is provided, an instance ofDataCollatorWithPadding()
otherwise.train_dataset (
torch.utils.data.dataset.Dataset
, optional) – The dataset to use for training. If it is andatasets.Dataset
, columns not accepted by themodel.forward()
method are automatically removed.eval_dataset (
torch.utils.data.dataset.Dataset
, optional) – The dataset to use for evaluation. If it is andatasets.Dataset
, columns not accepted by themodel.forward()
method are automatically removed.tokenizer (
PreTrainedTokenizerBase
, optional) – The tokenizer used to preprocess the data. If provided, will be used to automatically pad the inputs the maximum length when batching inputs, and it will be saved along the model to make it easier to rerun an interrupted training or reuse the fine-tuned model.model_init (
Callable[[], PreTrainedModel]
, optional) –A function that instantiates the model to be used. If provided, each call to
train()
will start from a new instance of the model as given by this function.The function may have zero argument, or a single one containing the optuna/Ray Tune trial object, to be able to choose different architectures according to hyper parameters (such as layer count, sizes of inner layers, dropout probabilities etc).
compute_metrics (
Callable[[EvalPrediction], Dict]
, optional) – The function that will be used to compute metrics at evaluation. Must take aEvalPrediction
and return a dictionary string to metric values.callbacks (List of
TrainerCallback
, optional) –A list of callbacks to customize the training loop. Will add those to the list of default callbacks detailed in here.
If you want to remove one of the default callbacks used, use the
Trainer.remove_callback()
method.optimizers (
Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR
, optional) – A tuple containing the optimizer and the scheduler to use. Will default to an instance ofAdamW
on your model and a scheduler given byget_linear_schedule_with_warmup()
controlled byargs
.kwargs – Deprecated keyword arguments.
-
add_callback
(callback)[source]¶ Add a callback to the current list of
TrainerCallback
.- Parameters
callback (
type
orTrainerCallback
) – ATrainerCallback
class or an instance of aTrainerCallback
. In the first case, will instantiate a member of that class.
-
compute_loss
(model, inputs)[source]¶ How the loss is computed by Trainer. By default, all models return the loss in the first element.
Subclass and override for custom behavior.
-
create_optimizer_and_scheduler
(num_training_steps: int)[source]¶ Setup the optimizer and the learning rate scheduler.
We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the Trainer’s init through
optimizers
, or subclass and override this method in a subclass.
-
evaluate
(eval_dataset: Optional[torch.utils.data.dataset.Dataset] = None) → Dict[str, float][source]¶ Run evaluation and returns metrics.
The calling script will be responsible for providing a method to compute metrics, as they are task-dependent (pass it to the init
compute_metrics
argument).You can also subclass and override this method to inject custom behavior.
- Parameters
eval_dataset (
Dataset
, optional) – Pass a dataset if you wish to overrideself.eval_dataset
. If it is andatasets.Dataset
, columns not accepted by themodel.forward()
method are automatically removed. It must implement the__len__
method.- Returns
A dictionary containing the evaluation loss and the potential metrics computed from the predictions. The dictionary also contains the epoch number which comes from the training state.
-
floating_point_ops
(inputs: Dict[str, Union[torch.Tensor, Any]])[source]¶ For models that inherit from
PreTrainedModel
, uses that method to compute the number of floating point operations for every backward + forward pass. If using another model, either implement such a method in the model or subclass and override this method.- Parameters
inputs (
Dict[str, Union[torch.Tensor, Any]]
) – The inputs and targets of the model.- Returns
The number of floating-point operations.
- Return type
int
-
get_eval_dataloader
(eval_dataset: Optional[torch.utils.data.dataset.Dataset] = None) → torch.utils.data.dataloader.DataLoader[source]¶ Returns the evaluation
DataLoader
.Subclass and override this method if you want to inject some custom behavior.
- Parameters
eval_dataset (
torch.utils.data.dataset.Dataset
, optional) – If provided, will overrideself.eval_dataset
. If it is andatasets.Dataset
, columns not accepted by themodel.forward()
method are automatically removed. It must implement__len__
.
-
get_test_dataloader
(test_dataset: torch.utils.data.dataset.Dataset) → torch.utils.data.dataloader.DataLoader[source]¶ Returns the test
DataLoader
.Subclass and override this method if you want to inject some custom behavior.
- Parameters
test_dataset (
torch.utils.data.dataset.Dataset
, optional) – The test dataset to use. If it is andatasets.Dataset
, columns not accepted by themodel.forward()
method are automatically removed. It must implement__len__
.
-
get_train_dataloader
() → torch.utils.data.dataloader.DataLoader[source]¶ Returns the training
DataLoader
.Will use no sampler if
self.train_dataset
does not implement__len__
, a random sampler (adapted to distributed training if necessary) otherwise.Subclass and override this method if you want to inject some custom behavior.
-
hyperparameter_search
(hp_space: Optional[Callable[optuna.Trial, Dict[str, float]]] = None, compute_objective: Optional[Callable[Dict[str, float], float]] = None, n_trials: int = 20, direction: str = 'minimize', backend: Optional[Union[str, transformers.trainer_utils.HPSearchBackend]] = None, hp_name: Optional[Callable[optuna.Trial, str]] = None, **kwargs) → transformers.trainer_utils.BestRun[source]¶ Launch an hyperparameter search using
optuna
orRay Tune
. The optimized quantity is determined bycompute_objectie
, which defaults to a function returning the evaluation loss when no metric is provided, the sum of all metrics otherwise.Warning
To use this method, you need to have provided a
model_init
when initializing yourTrainer
: we need to reinitialize the model at each new run. This is incompatible with theoptimizers
argument, so you need to subclassTrainer
and override the methodcreate_optimizer_and_scheduler()
for custom optimizer/scheduler.- Parameters
hp_space (
Callable[["optuna.Trial"], Dict[str, float]]
, optional) – A function that defines the hyperparameter search space. Will default todefault_hp_space_optuna()
ordefault_hp_space_ray()
depending on your backend.compute_objective (
Callable[[Dict[str, float]], float]
, optional) – A function computing the objective to minimize or maximize from the metrics returned by theevaluate
method. Will default todefault_compute_objective()
.n_trials (
int
, optional, defaults to 100) – The number of trial runs to test.direction (
str
, optional, defaults to"minimize"
) – Whether to optimize greater or lower objects. Can be"minimize"
or"maximize"
, you should pick"minimize"
when optimizing the validation loss,"maximize"
when optimizing one or several metrics.backend (
str
orHPSearchBackend
, optional) – The backend to use for hyperparameter search. Will default to optuna or Ray Tune, depending on which one is installed. If both are installed, will default to optuna.kwargs –
Additional keyword arguments passed along to
optuna.create_study
orray.tune.run
. For more information see:the documentation of optuna.create_study
the documentation of tune.run
- Returns
All the information about the best run.
- Return type
transformers.trainer_utils.BestRun
-
is_local_master
() → bool[source]¶ Whether or not this process is the local (e.g., on one machine if training in a distributed fashion on several machines) main process.
Warning
This method is deprecated, use
is_local_process_zero()
instead.
-
is_local_process_zero
() → bool[source]¶ Whether or not this process is the local (e.g., on one machine if training in a distributed fashion on several machines) main process.
-
is_world_master
() → bool[source]¶ Whether or not this process is the global main process (when training in a distributed fashion on several machines, this is only going to be
True
for one process).Warning
This method is deprecated, use
is_world_process_zero()
instead.
-
is_world_process_zero
() → bool[source]¶ Whether or not this process is the global main process (when training in a distributed fashion on several machines, this is only going to be
True
for one process).
-
log
(logs: Dict[str, float]) → None[source]¶ Log
logs
on the various objects watching training.Subclass and override this method to inject custom behavior.
- Parameters
logs (
Dict[str, float]
) – The values to log.
-
num_examples
(dataloader: torch.utils.data.dataloader.DataLoader) → int[source]¶ Helper to get number of samples in a
DataLoader
by accessing its dataset.Will raise an exception if the underlying dataset dese not implement method
__len__
-
pop_callback
(callback)[source]¶ Remove a callback from the current list of
TrainerCallback
and returns it.If the callback is not found, returns
None
(and no error is raised).- Parameters
callback (
type
orTrainerCallback
) – ATrainerCallback
class or an instance of aTrainerCallback
. In the first case, will pop the first member of that class found in the list of callbacks.- Returns
The callback removed, if found.
- Return type
TrainerCallback
-
predict
(test_dataset: torch.utils.data.dataset.Dataset) → transformers.trainer_utils.PredictionOutput[source]¶ Run prediction and returns predictions and potential metrics.
Depending on the dataset and your use case, your test dataset may contain labels. In that case, this method will also return metrics, like in
evaluate()
.- Parameters
test_dataset (
Dataset
) – Dataset to run the predictions on. If it is andatasets.Dataset
, columns not accepted by themodel.forward()
method are automatically removed. Has to implement the method__len__
Note
If your predictions or labels have different sequence length (for instance because you’re doing dynamic padding in a token classification task) the predictions will be padded (on the right) to allow for concatenation into one array. The padding index is -100.
Returns: NamedTuple A namedtuple with the following keys:
predictions (
np.ndarray
): The predictions ontest_dataset
.label_ids (
np.ndarray
, optional): The labels (if the dataset contained some).metrics (
Dict[str, float]
, optional): The potential dictionary of metrics (if the dataset contained labels).
-
prediction_loop
(dataloader: torch.utils.data.dataloader.DataLoader, description: str, prediction_loss_only: Optional[bool] = None) → transformers.trainer_utils.PredictionOutput[source]¶ Prediction/evaluation loop, shared by
Trainer.evaluate()
andTrainer.predict()
.Works both with or without labels.
-
prediction_step
(model: torch.nn.modules.module.Module, inputs: Dict[str, Union[torch.Tensor, Any]], prediction_loss_only: bool) → Tuple[Optional[float], Optional[torch.Tensor], Optional[torch.Tensor]][source]¶ Perform an evaluation step on
model
using obj:inputs.Subclass and override to inject custom behavior.
- Parameters
model (
nn.Module
) – The model to evaluate.inputs (
Dict[str, Union[torch.Tensor, Any]]
) –The inputs and targets of the model.
The dictionary will be unpacked before being fed to the model. Most models expect the targets under the argument
labels
. Check your model’s documentation for all accepted arguments.prediction_loss_only (
bool
) – Whether or not to return the loss only.
- Returns
A tuple with the loss, logits and labels (each being optional).
- Return type
Tuple[Optional[float], Optional[torch.Tensor], Optional[torch.Tensor]]
-
remove_callback
(callback)[source]¶ Remove a callback from the current list of
TrainerCallback
.- Parameters
callback (
type
orTrainerCallback
) – ATrainerCallback
class or an instance of aTrainerCallback
. In the first case, will remove the first member of that class found in the list of callbacks.
-
save_model
(output_dir: Optional[str] = None)[source]¶ Will save the model, so you can reload it using
from_pretrained()
.Will only save from the world_master process (unless in TPUs).
-
train
(model_path: Optional[str] = None, trial: Union[optuna.Trial, Dict[str, Any]] = None)[source]¶ Main training entry point.
- Parameters
model_path (
str
, optional) – Local path to the model if the model to train has been instantiated from a local path. If present, training will resume from the optimizer/scheduler states loaded here.trial (
optuna.Trial
orDict[str, Any]
, optional) – The trial run or the hyperparameter dictionary for hyperparameter search.
-
training_step
(model: torch.nn.modules.module.Module, inputs: Dict[str, Union[torch.Tensor, Any]]) → torch.Tensor[source]¶ Perform a training step on a batch of inputs.
Subclass and override to inject custom behavior.
- Parameters
model (
nn.Module
) – The model to train.inputs (
Dict[str, Union[torch.Tensor, Any]]
) –The inputs and targets of the model.
The dictionary will be unpacked before being fed to the model. Most models expect the targets under the argument
labels
. Check your model’s documentation for all accepted arguments.
- Returns
The tensor with training loss on this batch.
- Return type
torch.Tensor
TFTrainer¶
-
class
transformers.
TFTrainer
(model: transformers.modeling_tf_utils.TFPreTrainedModel, args: transformers.training_args_tf.TFTrainingArguments, train_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None, eval_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None, compute_metrics: Optional[Callable[transformers.trainer_utils.EvalPrediction, Dict]] = None, tb_writer: Optional[tensorflow.python.ops.summary_ops_v2.SummaryWriter] = None, optimizers: Tuple[tensorflow.python.keras.optimizer_v2.optimizer_v2.OptimizerV2, tensorflow.python.keras.optimizer_v2.learning_rate_schedule.LearningRateSchedule] = None, None, **kwargs)[source]¶ TFTrainer is a simple but feature-complete training and eval loop for TensorFlow, optimized for 🤗 Transformers.
- Parameters
model (
TFPreTrainedModel
) – The model to train, evaluate or use for predictions.args (
TFTrainingArguments
) – The arguments to tweak training.train_dataset (
Dataset
, optional) – The dataset to use for training. The dataset should yield tuples of(features, labels)
wherefeatures
is a dict of input features andlabels
is the labels. Iflabels
is a tensor, the loss is calculated by the model by callingmodel(features, labels=labels)
. Iflabels
is a dict, such as when using a QuestionAnswering head model with multiple targets, the loss is instead calculated by callingmodel(features, **labels)
.eval_dataset (
Dataset
, optional) – The dataset to use for evaluation. The dataset should yield tuples of(features, labels)
wherefeatures
is a dict of input features andlabels
is the labels. Iflabels
is a tensor, the loss is calculated by the model by callingmodel(features, labels=labels)
. Iflabels
is a dict, such as when using a QuestionAnswering head model with multiple targets, the loss is instead calculated by callingmodel(features, **labels)
.compute_metrics (
Callable[[EvalPrediction], Dict]
, optional) – The function that will be used to compute metrics at evaluation. Must take aEvalPrediction
and return a dictionary string to metric values.tb_writer (
tf.summary.SummaryWriter
, optional) – Object to write to TensorBoard.optimizers (
Tuple[tf.keras.optimizers.Optimizer, tf.keras.optimizers.schedules.LearningRateSchedule]
, optional) – A tuple containing the optimizer and the scheduler to use. The optimizer default to an instance oftf.keras.optimizers.Adam
ifargs.weight_decay_rate
is 0 else an instance ofAdamWeightDecay
. The scheduler will default to an instance oftf.keras.optimizers.schedules.PolynomialDecay
ifargs.num_warmup_steps
is 0 else an instance ofWarmUp
.kwargs – Deprecated keyword arguments.
-
create_optimizer_and_scheduler
(num_training_steps: int)[source]¶ Setup the optimizer and the learning rate scheduler.
We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the TFTrainer’s init through
optimizers
, or subclass and override this method.
-
evaluate
(eval_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None) → Dict[str, float][source]¶ Run evaluation and returns metrics.
The calling script will be responsible for providing a method to compute metrics, as they are task-dependent (pass it to the init
compute_metrics
argument).- Parameters
eval_dataset (
Dataset
, optional) – Pass a dataset if you wish to overrideself.eval_dataset
. The dataset should yield tuples of(features, labels)
wherefeatures
is a dict of input features andlabels
is the labels. Iflabels
is a tensor, the loss is calculated by the model by callingmodel(features, labels=labels)
. Iflabels
is a dict, such as when using a QuestionAnswering head model with multiple targets, the loss is instead calculated by callingmodel(features, **labels)
.- Returns
A dictionary containing the evaluation loss and the potential metrics computed from the predictions.
-
get_eval_tfdataset
(eval_dataset: Optional[tensorflow.python.data.ops.dataset_ops.DatasetV2] = None) → tensorflow.python.data.ops.dataset_ops.DatasetV2[source]¶ Returns the evaluation
Dataset
.- Parameters
eval_dataset (
Dataset
, optional) – If provided, will override self.eval_dataset. The dataset should yield tuples of(features, labels)
wherefeatures
is a dict of input features andlabels
is the labels. Iflabels
is a tensor, the loss is calculated by the model by callingmodel(features, labels=labels)
. Iflabels
is a dict, such as when using a QuestionAnswering head model with multiple targets, the loss is instead calculated by callingmodel(features, **labels)
.
Subclass and override this method if you want to inject some custom behavior.
-
get_test_tfdataset
(test_dataset: tensorflow.python.data.ops.dataset_ops.DatasetV2) → tensorflow.python.data.ops.dataset_ops.DatasetV2[source]¶ Returns a test
Dataset
.- Parameters
test_dataset (
Dataset
) – The dataset to use. The dataset should yield tuples of(features, labels)
wherefeatures
is a dict of input features andlabels
is the labels. Iflabels
is a tensor, the loss is calculated by the model by callingmodel(features, labels=labels)
. Iflabels
is a dict, such as when using a QuestionAnswering head model with multiple targets, the loss is instead calculated by callingmodel(features, **labels)
.
Subclass and override this method if you want to inject some custom behavior.
-
get_train_tfdataset
() → tensorflow.python.data.ops.dataset_ops.DatasetV2[source]¶ Returns the training
Dataset
.Subclass and override this method if you want to inject some custom behavior.
-
log
(logs: Dict[str, float]) → None[source]¶ Log
logs
on the various objects watching training.Subclass and override this method to inject custom behavior.
- Parameters
logs (
Dict[str, float]
) – The values to log.
-
predict
(test_dataset: tensorflow.python.data.ops.dataset_ops.DatasetV2) → transformers.trainer_utils.PredictionOutput[source]¶ Run prediction and returns predictions and potential metrics.
Depending on the dataset and your use case, your test dataset may contain labels. In that case, this method will also return metrics, like in
evaluate()
.- Parameters
test_dataset (
Dataset
) – Dataset to run the predictions on. The dataset should yield tuples of(features, labels)
wherefeatures
is a dict of input features andlabels
is the labels. Iflabels
is a tensor, the loss is calculated by the model by callingmodel(features, labels=labels)
. Iflabels
is a dict, such as when using a QuestionAnswering head model with multiple targets, the loss is instead calculated by callingmodel(features, **labels)
Returns: NamedTuple A namedtuple with the following keys:
predictions (
np.ndarray
): The predictions ontest_dataset
.label_ids (
np.ndarray
, optional): The labels (if the dataset contained some).metrics (
Dict[str, float]
, optional): The potential dictionary of metrics (if the dataset contained labels).
-
prediction_loop
(dataset: tensorflow.python.data.ops.dataset_ops.DatasetV2, steps: int, num_examples: int, description: str, prediction_loss_only: Optional[bool] = None) → transformers.trainer_utils.PredictionOutput[source]¶ Prediction/evaluation loop, shared by
evaluate()
andpredict()
.Works both with or without labels.
-
prediction_step
(features: tensorflow.python.framework.ops.Tensor, labels: tensorflow.python.framework.ops.Tensor, nb_instances_in_global_batch: tensorflow.python.framework.ops.Tensor) → tensorflow.python.framework.ops.Tensor[source]¶ Compute the prediction on features and update the loss with labels.
Subclass and override to inject some custom behavior.
-
run_model
(features, labels, training)[source]¶ Computes the loss of the given features and labels pair.
Subclass and override this method if you want to inject some custom behavior.
- Parameters
features (
tf.Tensor
) – A batch of input features.labels (
tf.Tensor
) – A batch of labels.training (
bool
) – Whether or not to run the model in training mode.
- Returns
The loss and logits.
- Return type
A tuple of two
tf.Tensor
-
save_model
(output_dir: Optional[str] = None)[source]¶ Will save the model, so you can reload it using
from_pretrained()
.
-
setup_comet
()[source]¶ Setup the optional Comet.ml integration.
- Environment:
- COMET_MODE:
(Optional): str - “OFFLINE”, “ONLINE”, or “DISABLED”
- COMET_PROJECT_NAME:
(Optional): str - Comet.ml project name for experiments
- COMET_OFFLINE_DIRECTORY:
(Optional): str - folder to use for saving offline experiments when COMET_MODE is “OFFLINE”
For a number of configurable items in the environment, see here
-
setup_wandb
()[source]¶ Setup the optional Weights & Biases (wandb) integration.
One can subclass and override this method to customize the setup if needed. Find more information here. You can also override the following environment variables:
- Environment:
- WANDB_PROJECT:
(Optional): str - “huggingface” by default, set this to a custom string to store results in a different project.
- WANDB_DISABLED:
(Optional): boolean - defaults to false, set to “true” to disable wandb entirely.
TrainingArguments¶
-
class
transformers.
TrainingArguments
(output_dir: str, overwrite_output_dir: bool = False, do_train: bool = False, do_eval: bool = None, do_predict: bool = False, evaluate_during_training: bool = False, evaluation_strategy: transformers.trainer_utils.EvaluationStrategy = 'no', prediction_loss_only: bool = False, per_device_train_batch_size: int = 8, per_device_eval_batch_size: int = 8, per_gpu_train_batch_size: Optional[int] = None, per_gpu_eval_batch_size: Optional[int] = None, gradient_accumulation_steps: int = 1, eval_accumulation_steps: Optional[int] = None, learning_rate: float = 5e-05, weight_decay: float = 0.0, adam_beta1: float = 0.9, adam_beta2: float = 0.999, adam_epsilon: float = 1e-08, max_grad_norm: float = 1.0, num_train_epochs: float = 3.0, max_steps: int = -1, warmup_steps: int = 0, logging_dir: Optional[str] = <factory>, logging_first_step: bool = False, logging_steps: int = 500, save_steps: int = 500, save_total_limit: Optional[int] = None, no_cuda: bool = False, seed: int = 42, fp16: bool = False, fp16_opt_level: str = 'O1', local_rank: int = -1, tpu_num_cores: Optional[int] = None, tpu_metrics_debug: bool = False, debug: bool = False, dataloader_drop_last: bool = False, eval_steps: int = None, dataloader_num_workers: int = 0, past_index: int = -1, run_name: Optional[str] = None, disable_tqdm: Optional[bool] = None, remove_unused_columns: Optional[bool] = True, label_names: Optional[List[str]] = None, load_best_model_at_end: Optional[bool] = False, metric_for_best_model: Optional[str] = None, greater_is_better: Optional[bool] = None)[source]¶ TrainingArguments is the subset of the arguments we use in our example scripts which relate to the training loop itself.
Using
HfArgumentParser
we can turn this class into argparse arguments to be able to specify them on the command line.- Parameters
output_dir (
str
) – The output directory where the model predictions and checkpoints will be written.overwrite_output_dir (
bool
, optional, defaults toFalse
) – IfTrue
, overwrite the content of the output directory. Use this to continue training ifoutput_dir
points to a checkpoint directory.do_train (
bool
, optional, defaults toFalse
) – Whether to run training or not. This argument is not directly used byTrainer
, it’s intended to be used by your training/evaluation scripts instead. See the example scripts for more details.do_eval (
bool
, optional) – Whether to run evaluation on the dev set or not. Will be set toTrue
ifevaluation_strategy
is different from"no"
. This argument is not directly used byTrainer
, it’s intended to be used by your training/evaluation scripts instead. See the example scripts for more details.do_predict (
bool
, optional, defaults toFalse
) – Whether to run predictions on the test set or not. This argument is not directly used byTrainer
, it’s intended to be used by your training/evaluation scripts instead. See the example scripts for more details.evaluation_strategy (
str
orEvaluationStrategy
, optional, defaults to"no"
) –The evaluation strategy to adopt during training. Possible values are:
"no"
: No evaluation is done during training."steps"
: Evaluation is done (and logged) everyeval_steps
."epoch"
: Evaluation is done at the end of each epoch.
prediction_loss_only (
bool
, optional, defaults to False) – When performing evaluation and predictions, only returns the loss.per_device_train_batch_size (
int
, optional, defaults to 8) – The batch size per GPU/TPU core/CPU for training.per_device_eval_batch_size (
int
, optional, defaults to 8) – The batch size per GPU/TPU core/CPU for evaluation.gradient_accumulation_steps (
int
, optional, defaults to 1) –Number of updates steps to accumulate the gradients for, before performing a backward/update pass.
Warning
When using gradient accumulation, one step is counted as one step with backward pass. Therefore, logging, evaluation, save will be conducted every
gradient_accumulation_steps * xxx_step
training examples.eval_accumulation_steps (
int
, optional) – Number of predictions steps to accumulate the output tensors for, before moving the results to the CPU. If left unset, the whole predictions are accumulated on GPU/TPU before being moved to the CPU (faster but requires more memory).learning_rate (
float
, optional, defaults to 5e-5) – The initial learning rate for Adam.weight_decay (
float
, optional, defaults to 0) – The weight decay to apply (if not zero).adam_epsilon (
float
, optional, defaults to 1e-8) – Epsilon for the Adam optimizer.max_grad_norm (
float
, optional, defaults to 1.0) – Maximum gradient norm (for gradient clipping).num_train_epochs (
float
, optional, defaults to 3.0) – Total number of training epochs to perform (if not an integer, will perform the decimal part percents of the last epoch before stopping training).max_steps (
int
, optional, defaults to -1) – If set to a positive number, the total number of training steps to perform. Overridesnum_train_epochs
.warmup_steps (
int
, optional, defaults to 0) – Number of steps used for a linear warmup from 0 tolearning_rate
.logging_dir (
str
, optional) – Tensorboard log directory. Will default to runs/**CURRENT_DATETIME_HOSTNAME**.logging_first_step (
bool
, optional, defaults toFalse
) – Whether to log and evaluate the firstglobal_step
or not.logging_steps (
int
, optional, defaults to 500) – Number of update steps between two logs.save_steps (
int
, optional, defaults to 500) – Number of updates steps before two checkpoint saves.save_total_limit (
int
, optional) – If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints inoutput_dir
.no_cuda (
bool
, optional, defaults toFalse
) – Whether to not use CUDA even when it is available or not.seed (
int
, optional, defaults to 42) – Random seed for initialization.fp16 (
bool
, optional, defaults toFalse
) – Whether to use 16-bit (mixed) precision training (through NVIDIA apex) instead of 32-bit training.fp16_opt_level (
str
, optional, defaults to ‘O1’) – Forfp16
training, apex AMP optimization level selected in [‘O0’, ‘O1’, ‘O2’, and ‘O3’]. See details on the apex documentation.local_rank (
int
, optional, defaults to -1) – During distributed training, the rank of the process.tpu_num_cores (
int
, optional) – When training on TPU, the number of TPU cores (automatically passed by launcher script).debug (
bool
, optional, defaults toFalse
) – When training on TPU, whether to print debug metrics or not.dataloader_drop_last (
bool
, optional, defaults toFalse
) – Whether to drop the last incomplete batch (if the length of the dataset is not divisible by the batch size) or not.eval_steps (
int
, optional) – Number of update steps between two evaluations ifevaluation_strategy="steps"
. Will default to the same value aslogging_steps
if not set.dataloader_num_workers (
int
, optional, defaults to 0) – Number of subprocesses to use for data loading (PyTorch only). 0 means that the data will be loaded in the main process.past_index (
int
, optional, defaults to -1) – Some models like TransformerXL or :doc`XLNet <../model_doc/xlnet>` can make use of the past hidden states for their predictions. If this argument is set to a positive int, theTrainer
will use the corresponding output (usually index 2) as the past state and feed it to the model at the next training step under the keyword argumentmems
.run_name (
str
, optional) – A descriptor for the run. Notably used for wandb logging.disable_tqdm (
bool
, optional) – Whether or not to disable the tqdm progress bars. Will default toTrue
if the logging level is set to warn or lower (default),False
otherwise.remove_unused_columns (
bool
, optional, defaults toTrue
) –If using nlp.Dataset datasets, whether or not to automatically remove the columns unused by the model forward method.
(Note that this behavior is not implemented for
TFTrainer
yet.)label_names (
List[str]
, optional) –The list of keys in your dictionary of inputs that correspond to the labels.
Will eventually default to
["labels"]
except if the model used is one of theXxxForQuestionAnswering
in which case it will default to["start_positions", "end_positions"]
.load_best_model_at_end (
bool
, optional, defaults toFalse
) –Whether or not to load the best model found during training at the end of training.
Note
When set to
True
, the parameterssave_steps
will be ignored and the model will be saved after each evaluation.metric_for_best_model (
str
, optional) –Use in conjunction with
load_best_model_at_end
to specify the metric to use to compare two different models. Must be the name of a metric returned by the evaluation with or without the prefix"eval_"
. Will default to"loss"
if unspecified andload_best_model_at_end=True
(to use the evaluation loss).If you set this value,
greater_is_better
will default toTrue
. Don’t forget to set it toFalse
if your metric is better when lower.greater_is_better (
bool
, optional) –Use in conjunction with
load_best_model_at_end
andmetric_for_best_model
to specify if better models should have a greater metric or not. Will default to:True
ifmetric_for_best_model
is set to a value that isn’t"loss"
or"eval_loss"
.False
ifmetric_for_best_model
is not set, or set to"loss"
or"eval_loss"
.
-
property
device
¶ The device used by this process.
-
property
eval_batch_size
¶ The actual batch size for evaluation (may differ from
per_gpu_eval_batch_size
in distributed training).
-
property
n_gpu
¶ The number of GPUs used by this process.
Note
This will only be greater than one when you have multiple GPUs available but are not using distributed training. For distributed training, it will always be 1.
-
to_dict
()[source]¶ Serializes this instance while replace Enum by their values (for JSON serialization support).
-
to_sanitized_dict
() → Dict[str, Any][source]¶ Sanitized serialization to use with TensorBoard’s hparams
-
property
train_batch_size
¶ The actual batch size for training (may differ from
per_gpu_train_batch_size
in distributed training).
TFTrainingArguments¶
-
class
transformers.
TFTrainingArguments
(output_dir: str, overwrite_output_dir: bool = False, do_train: bool = False, do_eval: bool = None, do_predict: bool = False, evaluate_during_training: bool = False, evaluation_strategy: transformers.trainer_utils.EvaluationStrategy = 'no', prediction_loss_only: bool = False, per_device_train_batch_size: int = 8, per_device_eval_batch_size: int = 8, per_gpu_train_batch_size: Optional[int] = None, per_gpu_eval_batch_size: Optional[int] = None, gradient_accumulation_steps: int = 1, eval_accumulation_steps: Optional[int] = None, learning_rate: float = 5e-05, weight_decay: float = 0.0, adam_beta1: float = 0.9, adam_beta2: float = 0.999, adam_epsilon: float = 1e-08, max_grad_norm: float = 1.0, num_train_epochs: float = 3.0, max_steps: int = -1, warmup_steps: int = 0, logging_dir: Optional[str] = <factory>, logging_first_step: bool = False, logging_steps: int = 500, save_steps: int = 500, save_total_limit: Optional[int] = None, no_cuda: bool = False, seed: int = 42, fp16: bool = False, fp16_opt_level: str = 'O1', local_rank: int = -1, tpu_num_cores: Optional[int] = None, tpu_metrics_debug: bool = False, debug: bool = False, dataloader_drop_last: bool = False, eval_steps: int = None, dataloader_num_workers: int = 0, past_index: int = -1, run_name: Optional[str] = None, disable_tqdm: Optional[bool] = None, remove_unused_columns: Optional[bool] = True, label_names: Optional[List[str]] = None, load_best_model_at_end: Optional[bool] = False, metric_for_best_model: Optional[str] = None, greater_is_better: Optional[bool] = None, tpu_name: str = None, poly_power: float = 1.0, xla: bool = False)[source]¶ TrainingArguments is the subset of the arguments we use in our example scripts which relate to the training loop itself.
Using
HfArgumentParser
we can turn this class into argparse arguments to be able to specify them on the command line.- Parameters
output_dir (
str
) – The output directory where the model predictions and checkpoints will be written.overwrite_output_dir (
bool
, optional, defaults toFalse
) – IfTrue
, overwrite the content of the output directory. Use this to continue training ifoutput_dir
points to a checkpoint directory.do_train (
bool
, optional, defaults toFalse
) – Whether to run training or not.do_eval (
bool
, optional, defaults toFalse
) – Whether to run evaluation on the dev set or not.do_predict (
bool
, optional, defaults toFalse
) – Whether to run predictions on the test set or not.evaluate_during_training (
bool
, optional, defaults toFalse
) – Whether to run evaluation during training at each logging step or not.per_device_train_batch_size (
int
, optional, defaults to 8) – The batch size per GPU/TPU core/CPU for training.per_device_eval_batch_size (
int
, optional, defaults to 8) – The batch size per GPU/TPU core/CPU for evaluation.gradient_accumulation_steps –
(
int
, optional, defaults to 1): Number of updates steps to accumulate the gradients for, before performing a backward/update pass.Warning
When using gradient accumulation, one step is counted as one step with backward pass. Therefore, logging, evaluation, save will be conducted every
gradient_accumulation_steps * xxx_step
training examples.learning_rate (
float
, optional, defaults to 5e-5) – The initial learning rate for Adam.weight_decay (
float
, optional, defaults to 0) – The weight decay to apply (if not zero).adam_epsilon (
float
, optional, defaults to 1e-8) – Epsilon for the Adam optimizer.max_grad_norm (
float
, optional, defaults to 1.0) – Maximum gradient norm (for gradient clipping).num_train_epochs (
float
, optional, defaults to 3.0) – Total number of training epochs to perform.max_steps (
int
, optional, defaults to -1) – If set to a positive number, the total number of training steps to perform. Overridesnum_train_epochs
.warmup_steps (
int
, optional, defaults to 0) – Number of steps used for a linear warmup from 0 tolearning_rate
.logging_dir (
str
, optional) – Tensorboard log directory. Will default to runs/**CURRENT_DATETIME_HOSTNAME**.logging_first_step (
bool
, optional, defaults toFalse
) – Whether to log and evaluate the firstglobal_step
or not.logging_steps (
int
, optional, defaults to 500) – Number of update steps between two logs.save_steps (
int
, optional, defaults to 500) – Number of updates steps before two checkpoint saves.save_total_limit (
int
, optional) – If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints inoutput_dir
.no_cuda (
bool
, optional, defaults toFalse
) – Whether to not use CUDA even when it is available or not.seed (
int
, optional, defaults to 42) – Random seed for initialization.fp16 (
bool
, optional, defaults toFalse
) – Whether to use 16-bit (mixed) precision training (through NVIDIA apex) instead of 32-bit training.fp16_opt_level (
str
, optional, defaults to ‘O1’) – Forfp16
training, apex AMP optimization level selected in [‘O0’, ‘O1’, ‘O2’, and ‘O3’]. See details on the apex documentation.local_rank (
int
, optional, defaults to -1) – During distributed training, the rank of the process.tpu_num_cores (
int
, optional) – When training on TPU, the number of TPU cores (automatically passed by launcher script).debug (
bool
, optional, defaults toFalse
) – Whether to activate the trace to record computation graphs and profiling information or not.dataloader_drop_last (
bool
, optional, defaults toFalse
) – Whether to drop the last incomplete batch (if the length of the dataset is not divisible by the batch size) or not.eval_steps (
int
, optional, defaults to 1000) – Number of update steps before two evaluations.past_index (
int
, optional, defaults to -1) – Some models like TransformerXL or :doc`XLNet <../model_doc/xlnet>` can make use of the past hidden states for their predictions. If this argument is set to a positive int, theTrainer
will use the corresponding output (usually index 2) as the past state and feed it to the model at the next training step under the keyword argumentmems
.tpu_name (
str
, optional) – The name of the TPU the process is running on.run_name (
str
, optional) – A descriptor for the run. Notably used for wandb logging.xla (
bool
, optional) – Whether to activate the XLA compilation or not.
-
property
eval_batch_size
¶ The actual batch size for evaluation (may differ from
per_gpu_eval_batch_size
in distributed training).
-
property
n_gpu
¶ The number of replicas (CPUs, GPUs or TPU cores) used in this training.
-
property
n_replicas
¶ The number of replicas (CPUs, GPUs or TPU cores) used in this training.
-
property
strategy
¶ The strategy used for distributed training.
-
property
train_batch_size
¶ The actual batch size for training (may differ from
per_gpu_train_batch_size
in distributed training).