Optimum documentation

GaudiTrainer

You are viewing v1.8.6 version. A newer version v1.19.0 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

GaudiTrainer

The GaudiTrainer class provides an extended API for the feature-complete Transformers Trainer. It is used in all the example scripts.

Before instantiating your GaudiTrainer, create a GaudiTrainingArguments object to access all the points of customization during training.

The GaudiTrainer class is optimized for 🤗 Transformers models running on Habana Gaudi.

Here is an example of how to customize GaudiTrainer to use a weighted loss (useful when you have an unbalanced training set):

from torch import nn
from optimum.habana import GaudiTrainer


class CustomGaudiTrainer(GaudiTrainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.get("labels")
        # forward pass
        outputs = model(**inputs)
        logits = outputs.get("logits")
        # compute custom loss (suppose one has 3 labels with different weights)
        loss_fct = nn.CrossEntropyLoss(weight=torch.tensor([1.0, 2.0, 3.0]))
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

Another way to customize the training loop behavior for the PyTorch GaudiTrainer is to use callbacks that can inspect the training loop state (for progress reporting, logging on TensorBoard or other ML platforms…) and take decisions (like early stopping).

GaudiTrainer

class optimum.habana.GaudiTrainer

< >

( model: typing.Union[transformers.modeling_utils.PreTrainedModel, torch.nn.modules.module.Module] = None gaudi_config: GaudiConfig = None args: TrainingArguments = None data_collator: typing.Optional[DataCollator] = None train_dataset: typing.Optional[torch.utils.data.dataset.Dataset] = None eval_dataset: typing.Union[torch.utils.data.dataset.Dataset, typing.Dict[str, torch.utils.data.dataset.Dataset], NoneType] = None tokenizer: typing.Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase] = None model_init: typing.Union[typing.Callable[[], transformers.modeling_utils.PreTrainedModel], NoneType] = None compute_metrics: typing.Union[typing.Callable[[transformers.trainer_utils.EvalPrediction], typing.Dict], NoneType] = None callbacks: typing.Optional[typing.List[transformers.trainer_callback.TrainerCallback]] = None optimizers: typing.Tuple[torch.optim.optimizer.Optimizer, torch.optim.lr_scheduler.LambdaLR] = (None, None) preprocess_logits_for_metrics: typing.Union[typing.Callable[[torch.Tensor, torch.Tensor], torch.Tensor], NoneType] = None )

GaudiTrainer is built on top of the tranformers’ Trainer to enable deployment on Habana’s Gaudi.

create_optimizer

< >

( )

Setup the optimizer. We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the Trainer’s init through optimizers, or subclass and override this method in a subclass.

evaluation_loop

< >

( dataloader: DataLoader description: str prediction_loss_only: typing.Optional[bool] = None ignore_keys: typing.Optional[typing.List[str]] = None metric_key_prefix: str = 'eval' )

Prediction/evaluation loop, shared by Trainer.evaluate() and Trainer.predict(). Works both with or without labels.

log

< >

( logs: typing.Dict[str, float] )

Parameters

  • logs (Dict[str, float]) — The values to log.

Log logs on the various objects watching training. Subclass and override this method to inject custom behavior.

prediction_loop

< >

( dataloader: DataLoader description: str prediction_loss_only: typing.Optional[bool] = None ignore_keys: typing.Optional[typing.List[str]] = None metric_key_prefix: str = 'eval' )

Prediction/evaluation loop, shared by Trainer.evaluate() and Trainer.predict(). Works both with or without labels.

prediction_step

< >

( model: Module inputs: typing.Dict[str, typing.Union[torch.Tensor, typing.Any]] prediction_loss_only: bool ignore_keys: typing.Optional[typing.List[str]] = None ) Tuple[Optional[torch.Tensor], Optional[torch.Tensor], Optional[torch.Tensor]]

Parameters

  • model (torch.nn.Module) — The model to evaluate.
  • inputs (Dict[str, Union[torch.Tensor, Any]]) — The inputs and targets of the model. The dictionary will be unpacked before being fed to the model. Most models expect the targets under the argument labels. Check your model’s documentation for all accepted arguments.
  • prediction_loss_only (bool) — Whether or not to return the loss only.
  • ignore_keys (List[str], optional) — A list of keys in the output of your model (if it is a dictionary) that should be ignored when gathering predictions.

Returns

Tuple[Optional[torch.Tensor], Optional[torch.Tensor], Optional[torch.Tensor]]

A tuple with the loss, logits and labels (each being optional).

Perform an evaluation step on model using inputs. Subclass and override to inject custom behavior.

save_model

< >

( output_dir: typing.Optional[str] = None _internal_call: bool = False )

Will save the model, so you can reload it using from_pretrained(). Will only save from the main process.

train

< >

( resume_from_checkpoint: typing.Union[bool, str, NoneType] = None trial: typing.Union[ForwardRef('optuna.Trial'), typing.Dict[str, typing.Any]] = None ignore_keys_for_eval: typing.Optional[typing.List[str]] = None **kwargs )

Parameters

  • resume_from_checkpoint (str or bool, optional) — If a str, local path to a saved checkpoint as saved by a previous instance of Trainer. If a bool and equals True, load the last checkpoint in args.output_dir as saved by a previous instance of Trainer. If present, training will resume from the model/optimizer/scheduler states loaded here.
  • trial (optuna.Trial or Dict[str, Any], optional) — The trial run or the hyperparameter dictionary for hyperparameter search.
  • ignore_keys_for_eval (List[str], optional) — A list of keys in the output of your model (if it is a dictionary) that should be ignored when gathering predictions for evaluation during the training. kwargs — Additional keyword arguments used to hide deprecated arguments

Main training entry point.

training_step

< >

( model: Module inputs: typing.Dict[str, typing.Union[torch.Tensor, typing.Any]] ) torch.Tensor

Parameters

  • model (torch.nn.Module) — The model to train.
  • inputs (Dict[str, Union[torch.Tensor, Any]]) — The inputs and targets of the model.

    The dictionary will be unpacked before being fed to the model. Most models expect the targets under the argument labels. Check your model’s documentation for all accepted arguments.

Returns

torch.Tensor

The tensor with training loss on this batch.

Perform a training step on a batch of inputs.

Subclass and override to inject custom behavior.

GaudiSeq2SeqTrainer

class optimum.habana.GaudiSeq2SeqTrainer

< >

( model: typing.Union[ForwardRef('PreTrainedModel'), torch.nn.modules.module.Module] = None gaudi_config: GaudiConfig = None args: GaudiTrainingArguments = None data_collator: typing.Optional[ForwardRef('DataCollator')] = None train_dataset: typing.Optional[torch.utils.data.dataset.Dataset] = None eval_dataset: typing.Union[torch.utils.data.dataset.Dataset, typing.Dict[str, torch.utils.data.dataset.Dataset], NoneType] = None tokenizer: typing.Optional[ForwardRef('PreTrainedTokenizerBase')] = None model_init: typing.Union[typing.Callable[[], ForwardRef('PreTrainedModel')], NoneType] = None compute_metrics: typing.Union[typing.Callable[[ForwardRef('EvalPrediction')], typing.Dict], NoneType] = None callbacks: typing.Optional[typing.List[ForwardRef('TrainerCallback')]] = None optimizers: typing.Tuple[torch.optim.optimizer.Optimizer, torch.optim.lr_scheduler.LambdaLR] = (None, None) preprocess_logits_for_metrics: typing.Union[typing.Callable[[torch.Tensor, torch.Tensor], torch.Tensor], NoneType] = None )

evaluate

< >

( eval_dataset: typing.Optional[torch.utils.data.dataset.Dataset] = None ignore_keys: typing.Optional[typing.List[str]] = None metric_key_prefix: str = 'eval' **gen_kwargs )

Parameters

  • eval_dataset (Dataset, optional) — Pass a dataset if you wish to override self.eval_dataset. If it is an Dataset, columns not accepted by the model.forward() method are automatically removed. It must implement the __len__ method.
  • ignore_keys (List[str], optional) — A list of keys in the output of your model (if it is a dictionary) that should be ignored when gathering predictions.
  • metric_key_prefix (str, optional, defaults to "eval") — An optional prefix to be used as the metrics key prefix. For example the metrics “bleu” will be named “eval_bleu” if the prefix is "eval" (default)
  • max_length (int, optional) — The maximum target length to use when predicting with the generate method.
  • num_beams (int, optional) — Number of beams for beam search that will be used when predicting with the generate method. 1 means no beam search. gen_kwargs — Additional generate specific kwargs.

Run evaluation and returns metrics. The calling script will be responsible for providing a method to compute metrics, as they are task-dependent (pass it to the init compute_metrics argument). You can also subclass and override this method to inject custom behavior.

predict

< >

( test_dataset: Dataset ignore_keys: typing.Optional[typing.List[str]] = None metric_key_prefix: str = 'test' **gen_kwargs )

Parameters

  • test_dataset (Dataset) — Dataset to run the predictions on. If it is a Dataset, columns not accepted by the model.forward() method are automatically removed. Has to implement the method __len__
  • ignore_keys (List[str], optional) — A list of keys in the output of your model (if it is a dictionary) that should be ignored when gathering predictions.
  • metric_key_prefix (str, optional, defaults to "eval") — An optional prefix to be used as the metrics key prefix. For example the metrics “bleu” will be named “eval_bleu” if the prefix is "eval" (default)
  • max_length (int, optional) — The maximum target length to use when predicting with the generate method.
  • num_beams (int, optional) — Number of beams for beam search that will be used when predicting with the generate method. 1 means no beam search. gen_kwargs — Additional generate specific kwargs.

Run prediction and returns predictions and potential metrics. Depending on the dataset and your use case, your test dataset may contain labels. In that case, this method will also return metrics, like in evaluate().

If your predictions or labels have different sequence lengths (for instance because you're doing dynamic padding in a token classification task) the predictions will be padded (on the right) to allow for concatenation into one array. The padding index is -100.
Returns: *NamedTuple* A namedtuple with the following keys: - predictions (`np.ndarray`): The predictions on `test_dataset`. - label_ids (`np.ndarray`, *optional*): The labels (if the dataset contained some). - metrics (`Dict[str, float]`, *optional*): The potential dictionary of metrics (if the dataset contained labels).

GaudiTrainingArguments

class optimum.habana.GaudiTrainingArguments

< >

( output_dir: str overwrite_output_dir: bool = False do_train: bool = False do_eval: bool = False do_predict: bool = False evaluation_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = 'no' prediction_loss_only: bool = False per_device_train_batch_size: int = 8 per_device_eval_batch_size: int = 8 per_gpu_train_batch_size: typing.Optional[int] = None per_gpu_eval_batch_size: typing.Optional[int] = None gradient_accumulation_steps: int = 1 eval_accumulation_steps: typing.Optional[int] = None eval_delay: typing.Optional[float] = 0 learning_rate: float = 5e-05 weight_decay: float = 0.0 adam_beta1: float = 0.9 adam_beta2: float = 0.999 adam_epsilon: typing.Optional[float] = 1e-06 max_grad_norm: float = 1.0 num_train_epochs: float = 3.0 max_steps: int = -1 lr_scheduler_type: typing.Union[transformers.trainer_utils.SchedulerType, str] = 'linear' warmup_ratio: float = 0.0 warmup_steps: int = 0 log_level: typing.Optional[str] = 'passive' log_level_replica: typing.Optional[str] = 'warning' log_on_each_node: bool = True logging_dir: typing.Optional[str] = None logging_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = 'steps' logging_first_step: bool = False logging_steps: int = 500 logging_nan_inf_filter: typing.Optional[bool] = False save_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = 'steps' save_steps: int = 500 save_total_limit: typing.Optional[int] = None save_safetensors: typing.Optional[bool] = False save_on_each_node: bool = False no_cuda: bool = False use_mps_device: bool = False seed: int = 42 data_seed: typing.Optional[int] = None jit_mode_eval: bool = False use_ipex: bool = False bf16: bool = False fp16: bool = False fp16_opt_level: str = 'O1' half_precision_backend: str = 'auto' bf16_full_eval: bool = False fp16_full_eval: bool = False tf32: typing.Optional[bool] = None local_rank: int = -1 xpu_backend: typing.Optional[str] = None tpu_num_cores: typing.Optional[int] = None tpu_metrics_debug: bool = False debug: str = '' dataloader_drop_last: bool = False eval_steps: typing.Optional[int] = None dataloader_num_workers: int = 0 past_index: int = -1 run_name: typing.Optional[str] = None disable_tqdm: typing.Optional[bool] = None remove_unused_columns: typing.Optional[bool] = True label_names: typing.Optional[typing.List[str]] = None load_best_model_at_end: typing.Optional[bool] = False metric_for_best_model: typing.Optional[str] = None greater_is_better: typing.Optional[bool] = None ignore_data_skip: bool = False sharded_ddp: str = '' fsdp: str = '' fsdp_min_num_params: int = 0 fsdp_config: typing.Optional[str] = None fsdp_transformer_layer_cls_to_wrap: typing.Optional[str] = None deepspeed: typing.Optional[str] = None label_smoothing_factor: float = 0.0 optim: typing.Union[transformers.training_args.OptimizerNames, str, NoneType] = 'adamw_torch' optim_args: typing.Optional[str] = None adafactor: bool = False group_by_length: bool = False length_column_name: typing.Optional[str] = 'length' report_to: typing.Optional[typing.List[str]] = None ddp_find_unused_parameters: typing.Optional[bool] = False ddp_bucket_cap_mb: typing.Optional[int] = 230 dataloader_pin_memory: bool = True skip_memory_metrics: bool = True use_legacy_prediction_loop: bool = False push_to_hub: bool = False resume_from_checkpoint: typing.Optional[str] = None hub_model_id: typing.Optional[str] = None hub_strategy: typing.Union[transformers.trainer_utils.HubStrategy, str] = 'every_save' hub_token: typing.Optional[str] = None hub_private_repo: bool = False gradient_checkpointing: bool = False include_inputs_for_metrics: bool = False fp16_backend: str = 'auto' push_to_hub_model_id: typing.Optional[str] = None push_to_hub_organization: typing.Optional[str] = None push_to_hub_token: typing.Optional[str] = None mp_parameters: str = '' auto_find_batch_size: bool = False full_determinism: bool = False torchdynamo: typing.Optional[str] = None ray_scope: typing.Optional[str] = 'last' ddp_timeout: typing.Optional[int] = 1800 torch_compile: bool = False torch_compile_backend: typing.Optional[str] = None torch_compile_mode: typing.Optional[str] = None use_habana: typing.Optional[bool] = False gaudi_config_name: typing.Optional[str] = None use_lazy_mode: typing.Optional[bool] = False use_hpu_graphs: typing.Optional[bool] = False throughput_warmup_steps: typing.Optional[int] = 0 pipelining_fwd_bwd: typing.Optional[bool] = False non_blocking_data_copy: typing.Optional[bool] = False )

Parameters

  • use_habana (bool, optional, defaults to False) — Whether to use Habana’s HPU for running the model.
  • gaudi_config_name (str, optional) — Pretrained Gaudi config name or path.
  • use_lazy_mode (bool, optional, defaults to False) — Whether to use lazy mode for running the model.
  • use_hpu_graphs (bool, optional, defaults to False) — Whether to use HPU graphs for performing inference.
  • throughput_warmup_steps (int, optional, defaults to 0) — Number of steps to ignore for throughput calculation. For example, with throughput_warmup_steps=N, the first N steps will not be considered in the calculation of the throughput. This is especially useful in lazy mode where the first two or three iterations typically take longer.
  • pipelining_fwd_bwd (bool, optional, defaults to False) — Whether to add an additional mark_step between forward and backward for pipelining host backward building and HPU forward computing.
  • non_blocking_data_copy (bool, optional, defaults to False) — Whether to enable async data copy when preparing inputs.

GaudiTrainingArguments is built on top of the Tranformers’ TrainingArguments to enable deployment on Habana’s Gaudi.

GaudiSeq2SeqTrainingArguments

class optimum.habana.GaudiSeq2SeqTrainingArguments

< >

( output_dir: str overwrite_output_dir: bool = False do_train: bool = False do_eval: bool = False do_predict: bool = False evaluation_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = 'no' prediction_loss_only: bool = False per_device_train_batch_size: int = 8 per_device_eval_batch_size: int = 8 per_gpu_train_batch_size: typing.Optional[int] = None per_gpu_eval_batch_size: typing.Optional[int] = None gradient_accumulation_steps: int = 1 eval_accumulation_steps: typing.Optional[int] = None eval_delay: typing.Optional[float] = 0 learning_rate: float = 5e-05 weight_decay: float = 0.0 adam_beta1: float = 0.9 adam_beta2: float = 0.999 adam_epsilon: typing.Optional[float] = 1e-06 max_grad_norm: float = 1.0 num_train_epochs: float = 3.0 max_steps: int = -1 lr_scheduler_type: typing.Union[transformers.trainer_utils.SchedulerType, str] = 'linear' warmup_ratio: float = 0.0 warmup_steps: int = 0 log_level: typing.Optional[str] = 'passive' log_level_replica: typing.Optional[str] = 'warning' log_on_each_node: bool = True logging_dir: typing.Optional[str] = None logging_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = 'steps' logging_first_step: bool = False logging_steps: int = 500 logging_nan_inf_filter: typing.Optional[bool] = False save_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = 'steps' save_steps: int = 500 save_total_limit: typing.Optional[int] = None save_safetensors: typing.Optional[bool] = False save_on_each_node: bool = False no_cuda: bool = False use_mps_device: bool = False seed: int = 42 data_seed: typing.Optional[int] = None jit_mode_eval: bool = False use_ipex: bool = False bf16: bool = False fp16: bool = False fp16_opt_level: str = 'O1' half_precision_backend: str = 'auto' bf16_full_eval: bool = False fp16_full_eval: bool = False tf32: typing.Optional[bool] = None local_rank: int = -1 xpu_backend: typing.Optional[str] = None tpu_num_cores: typing.Optional[int] = None tpu_metrics_debug: bool = False debug: str = '' dataloader_drop_last: bool = False eval_steps: typing.Optional[int] = None dataloader_num_workers: int = 0 past_index: int = -1 run_name: typing.Optional[str] = None disable_tqdm: typing.Optional[bool] = None remove_unused_columns: typing.Optional[bool] = True label_names: typing.Optional[typing.List[str]] = None load_best_model_at_end: typing.Optional[bool] = False metric_for_best_model: typing.Optional[str] = None greater_is_better: typing.Optional[bool] = None ignore_data_skip: bool = False sharded_ddp: str = '' fsdp: str = '' fsdp_min_num_params: int = 0 fsdp_config: typing.Optional[str] = None fsdp_transformer_layer_cls_to_wrap: typing.Optional[str] = None deepspeed: typing.Optional[str] = None label_smoothing_factor: float = 0.0 optim: typing.Union[transformers.training_args.OptimizerNames, str, NoneType] = 'adamw_torch' optim_args: typing.Optional[str] = None adafactor: bool = False group_by_length: bool = False length_column_name: typing.Optional[str] = 'length' report_to: typing.Optional[typing.List[str]] = None ddp_find_unused_parameters: typing.Optional[bool] = False ddp_bucket_cap_mb: typing.Optional[int] = 230 dataloader_pin_memory: bool = True skip_memory_metrics: bool = True use_legacy_prediction_loop: bool = False push_to_hub: bool = False resume_from_checkpoint: typing.Optional[str] = None hub_model_id: typing.Optional[str] = None hub_strategy: typing.Union[transformers.trainer_utils.HubStrategy, str] = 'every_save' hub_token: typing.Optional[str] = None hub_private_repo: bool = False gradient_checkpointing: bool = False include_inputs_for_metrics: bool = False fp16_backend: str = 'auto' push_to_hub_model_id: typing.Optional[str] = None push_to_hub_organization: typing.Optional[str] = None push_to_hub_token: typing.Optional[str] = None mp_parameters: str = '' auto_find_batch_size: bool = False full_determinism: bool = False torchdynamo: typing.Optional[str] = None ray_scope: typing.Optional[str] = 'last' ddp_timeout: typing.Optional[int] = 1800 torch_compile: bool = False torch_compile_backend: typing.Optional[str] = None torch_compile_mode: typing.Optional[str] = None use_habana: typing.Optional[bool] = False gaudi_config_name: typing.Optional[str] = None use_lazy_mode: typing.Optional[bool] = False use_hpu_graphs: typing.Optional[bool] = False throughput_warmup_steps: typing.Optional[int] = 0 pipelining_fwd_bwd: typing.Optional[bool] = False non_blocking_data_copy: typing.Optional[bool] = False sortish_sampler: bool = False predict_with_generate: bool = False generation_max_length: typing.Optional[int] = None generation_num_beams: typing.Optional[int] = None generation_config: typing.Union[str, pathlib.Path, transformers.generation.configuration_utils.GenerationConfig, NoneType] = None )

Parameters

  • sortish_sampler (bool, optional, defaults to False) — Whether to use a sortish sampler or not. Only possible if the underlying datasets are Seq2SeqDataset for now but will become generally available in the near future. It sorts the inputs according to lengths in order to minimize the padding size, with a bit of randomness for the training set.
  • predict_with_generate (bool, optional, defaults to False) — Whether to use generate to calculate generative metrics (ROUGE, BLEU).
  • generation_max_length (int, optional) — The max_length to use on each evaluation loop when predict_with_generate=True. Will default to the max_length value of the model configuration.
  • generation_num_beams (int, optional) — The num_beams to use on each evaluation loop when predict_with_generate=True. Will default to the num_beams value of the model configuration.
  • generation_config (str or Path or transformers.generation.GenerationConfig, optional) — Allows to load a transformers.generation.GenerationConfig from the from_pretrained method. This can be either:

    • a string, the model id of a pretrained model configuration hosted inside a model repo on huggingface.co. Valid model ids can be located at the root-level, like bert-base-uncased, or namespaced under a user or organization name, like dbmdz/bert-base-german-cased.
    • a path to a directory containing a configuration file saved using the transformers.GenerationConfig.save_pretrained method, e.g., ./my_model_directory/.
    • a transformers.generation.GenerationConfig object.

GaudiSeq2SeqTrainingArguments is built on top of the Tranformers’ Seq2SeqTrainingArguments to enable deployment on Habana’s Gaudi.