TRL documentation

Trainer

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.13.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Trainer

At TRL we support PPO (Proximal Policy Optimisation) with an implementation that largely follows the structure introduced in the paper “Fine-Tuning Language Models from Human Preferences” by D. Ziegler et al. [paper, code]. The Trainer and model classes are largely inspired from transformers.Trainer and transformers.AutoModel classes and adapted for RL. We also support a RewardTrainer that can be used to train a reward model.

CPOConfig

class trl.CPOConfig

< >

( output_dir: str overwrite_output_dir: bool = False do_train: bool = False do_eval: bool = False do_predict: bool = False eval_strategy: Union = 'no' prediction_loss_only: bool = False per_device_train_batch_size: int = 8 per_device_eval_batch_size: int = 8 per_gpu_train_batch_size: Optional = None per_gpu_eval_batch_size: Optional = None gradient_accumulation_steps: int = 1 eval_accumulation_steps: Optional = None eval_delay: Optional = 0 torch_empty_cache_steps: Optional = None learning_rate: float = 5e-05 weight_decay: float = 0.0 adam_beta1: float = 0.9 adam_beta2: float = 0.999 adam_epsilon: float = 1e-08 max_grad_norm: float = 1.0 num_train_epochs: float = 3.0 max_steps: int = -1 lr_scheduler_type: Union = 'linear' lr_scheduler_kwargs: Union = <factory> warmup_ratio: float = 0.0 warmup_steps: int = 0 log_level: Optional = 'passive' log_level_replica: Optional = 'warning' log_on_each_node: bool = True logging_dir: Optional = None logging_strategy: Union = 'steps' logging_first_step: bool = False logging_steps: float = 500 logging_nan_inf_filter: bool = True save_strategy: Union = 'steps' save_steps: float = 500 save_total_limit: Optional = None save_safetensors: Optional = True save_on_each_node: bool = False save_only_model: bool = False restore_callback_states_from_checkpoint: bool = False no_cuda: bool = False use_cpu: bool = False use_mps_device: bool = False seed: int = 42 data_seed: Optional = None jit_mode_eval: bool = False use_ipex: bool = False bf16: bool = False fp16: bool = False fp16_opt_level: str = 'O1' half_precision_backend: str = 'auto' bf16_full_eval: bool = False fp16_full_eval: bool = False tf32: Optional = None local_rank: int = -1 ddp_backend: Optional = None tpu_num_cores: Optional = None tpu_metrics_debug: bool = False debug: Union = '' dataloader_drop_last: bool = False eval_steps: Optional = None dataloader_num_workers: int = 0 dataloader_prefetch_factor: Optional = None past_index: int = -1 run_name: Optional = None disable_tqdm: Optional = None remove_unused_columns: Optional = True label_names: Optional = None load_best_model_at_end: Optional = False metric_for_best_model: Optional = None greater_is_better: Optional = None ignore_data_skip: bool = False fsdp: Union = '' fsdp_min_num_params: int = 0 fsdp_config: Union = None fsdp_transformer_layer_cls_to_wrap: Optional = None accelerator_config: Union = None deepspeed: Union = None label_smoothing_factor: float = 0.0 optim: Union = 'adamw_torch' optim_args: Optional = None adafactor: bool = False group_by_length: bool = False length_column_name: Optional = 'length' report_to: Union = None ddp_find_unused_parameters: Optional = None ddp_bucket_cap_mb: Optional = None ddp_broadcast_buffers: Optional = None dataloader_pin_memory: bool = True dataloader_persistent_workers: bool = False skip_memory_metrics: bool = True use_legacy_prediction_loop: bool = False push_to_hub: bool = False resume_from_checkpoint: Optional = None hub_model_id: Optional = None hub_strategy: Union = 'every_save' hub_token: Optional = None hub_private_repo: bool = False hub_always_push: bool = False gradient_checkpointing: bool = False gradient_checkpointing_kwargs: Union = None include_inputs_for_metrics: bool = False eval_do_concat_batches: bool = True fp16_backend: str = 'auto' evaluation_strategy: Union = None push_to_hub_model_id: Optional = None push_to_hub_organization: Optional = None push_to_hub_token: Optional = None mp_parameters: str = '' auto_find_batch_size: bool = False full_determinism: bool = False torchdynamo: Optional = None ray_scope: Optional = 'last' ddp_timeout: Optional = 1800 torch_compile: bool = False torch_compile_backend: Optional = None torch_compile_mode: Optional = None dispatch_batches: Optional = None split_batches: Optional = None include_tokens_per_second: Optional = False include_num_input_tokens_seen: Optional = False neftune_noise_alpha: Optional = None optim_target_modules: Union = None batch_eval_metrics: bool = False eval_on_start: bool = False use_liger_kernel: Optional = False eval_use_gather_object: Optional = False max_length: Optional = None max_prompt_length: Optional = None max_completion_length: Optional = None beta: float = 0.1 label_smoothing: float = 0.0 loss_type: Literal = 'sigmoid' disable_dropout: bool = True cpo_alpha: float = 1.0 simpo_gamma: float = 0.5 label_pad_token_id: int = -100 padding_value: Optional = None truncation_mode: str = 'keep_end' generate_during_eval: bool = False is_encoder_decoder: Optional = None model_init_kwargs: Optional = None dataset_num_proc: Optional = None )

Parameters

  • max_length (Optional[int], optional, defaults to None) — Maximum length of the sequences (prompt + completion) in the batch. This argument is required if you want to use the default data collator.
  • max_prompt_length (Optional[int], optional, defaults to None) — Maximum length of the prompt. This argument is required if you want to use the default data collator.
  • max_completion_length (Optional[int], optional, defaults to None) — Maximum length of the completion. This argument is required if you want to use the default data collator and your model is an encoder-decoder.
  • beta (float, optional, defaults to 0.1) — Parameter controlling the deviation from the reference model. Higher β means less deviation from the reference model. For the IPO loss (loss_type="ipo"), β is the regularization parameter denoted by τ in the paper.
  • label_smoothing (float, optional, defaults to 0.0) — Label smoothing factor. This argument is required if you want to use the default data collator.
  • loss_type (str, optional, defaults to "sigmoid") — Type of loss to use. Possible values are:

    • "sigmoid": sigmoid loss from the original DPO paper.
    • "hinge": hinge loss on the normalized likelihood from the SLiC paper.
    • "ipo": IPO loss from the IPO paper.
    • "simpo": SimPO loss from the SimPO paper.
  • disable_dropout (bool, optional, defaults to True) — Whether to disable dropout in the model.
  • cpo_alpha (float, optional, defaults to 1.0) — Weight of the BC regularizer in CPO training.
  • simpo_gamma (float, optional, defaults to 0.5) — Target reward margin for the SimPO loss, used only when the loss_type="simpo".
  • label_pad_token_id (int, optional, defaults to -100) — Label pad token id. This argument is required if you want to use the default data collator.
  • padding_value (Optional[int], optional, defaults to None) — Padding value to use. If None, the padding value of the tokenizer is used.
  • truncation_mode (str,optional, defaults to "keep_end") — Truncation mode to use when the prompt is too long. Possible values are "keep_end" or "keep_start". This argument is required if you want to use the default data collator.
  • generate_during_eval (bool, optional, defaults to False) — If True, generates and logs completions from the model to W&B during evaluation.
  • is_encoder_decoder (Optional[bool], optional, defaults to None) — When using the model_init argument (callable) to instantiate the model instead of the model argument, you need to specify if the model returned by the callable is an encoder-decoder model.
  • model_init_kwargs (Optional[Dict[str, Any]], optional, defaults to None) — Keyword arguments to pass to AutoModelForCausalLM.from_pretrained when instantiating the model from a string.
  • dataset_num_proc (Optional[int], optional, defaults to None) — Number of processes to use for processing the dataset.

Configuration class for the CPOTrainer.

Using HfArgumentParser we can turn this class into argparse arguments that can be specified on the command line.

CPOTrainer

class trl.CPOTrainer

< >

( model: Union = None args: Optional = None data_collator: Optional = None train_dataset: Optional = None eval_dataset: Union = None tokenizer: Optional = None model_init: Optional = None callbacks: Optional = None optimizers: Tuple = (None, None) preprocess_logits_for_metrics: Optional = None peft_config: Optional = None compute_metrics: Optional = None )

Parameters

  • model (transformers.PreTrainedModel) — The model to train, preferably an AutoModelForSequenceClassification.
  • args (CPOConfig) — The CPO config arguments to use for training.
  • data_collator (transformers.DataCollator) — The data collator to use for training. If None is specified, the default data collator (DPODataCollatorWithPadding) will be used which will pad the sequences to the maximum length of the sequences in the batch, given a dataset of paired sequences.
  • train_dataset (datasets.Dataset) — The dataset to use for training.
  • eval_dataset (datasets.Dataset) — The dataset to use for evaluation.
  • tokenizer (transformers.PreTrainedTokenizerBase) — The tokenizer to use for training. This argument is required if you want to use the default data collator.
  • model_init (Callable[[], transformers.PreTrainedModel]) — The model initializer to use for training. If None is specified, the default model initializer will be used.
  • callbacks (List[transformers.TrainerCallback]) — The callbacks to use for training.
  • optimizers (Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR]) — The optimizer and scheduler to use for training.
  • preprocess_logits_for_metrics (Callable[[torch.Tensor, torch.Tensor], torch.Tensor]) — The function to use to preprocess the logits before computing the metrics.
  • peft_config (Dict, defaults to None) — The PEFT configuration to use for training. If you pass a PEFT configuration, the model will be wrapped in a PEFT model.
  • compute_metrics (Callable[[EvalPrediction], Dict], optional) — The function to use to compute the metrics. Must take a EvalPrediction and return a dictionary string to metric values.

Initialize CPOTrainer.

build_tokenized_answer

< >

( prompt answer )

Llama tokenizer does satisfy enc(a + b) = enc(a) + enc(b). It does ensure enc(a + b) = enc(a) + enc(a + b)[len(enc(a)):]. Reference: https://github.com/EleutherAI/lm-evaluation-harness/pull/531#issuecomment-1595586257

concatenated_forward

< >

( model: Module batch: Dict )

Run the given model on the given batch of inputs, concatenating the chosen and rejected inputs together.

We do this to avoid doing two forward passes, because it’s faster for FSDP.

concatenated_inputs

< >

( batch: Dict is_encoder_decoder: bool = False label_pad_token_id: int = -100 padding_value: int = 0 device: Optional = None )

Concatenate the chosen and rejected inputs into a single tensor.

cpo_loss

< >

( policy_chosen_logps: FloatTensor policy_rejected_logps: FloatTensor ) A tuple of three tensors

Returns

A tuple of three tensors

(losses, chosen_rewards, rejected_rewards). The losses tensor contains the CPO loss for each example in the batch. The chosen_rewards and rejected_rewards tensors contain the rewards for the chosen and rejected responses, respectively.

Compute the CPO loss for a batch of policy and reference model log probabilities.

evaluation_loop

< >

( dataloader: DataLoader description: str prediction_loss_only: Optional = None ignore_keys: Optional = None metric_key_prefix: str = 'eval' )

Overriding built-in evaluation loop to store metrics for each batch. Prediction/evaluation loop, shared by Trainer.evaluate() and Trainer.predict().

Works both with or without labels.

get_batch_logps

< >

( logits: FloatTensor labels: LongTensor average_log_prob: bool = False label_pad_token_id: int = -100 is_encoder_decoder: bool = False )

Compute the log probabilities of the given labels under the given logits.

get_batch_loss_metrics

< >

( model batch: Dict train_eval: Literal = 'train' )

Compute the CPO loss and other metrics for the given batch of inputs for train or test.

get_batch_samples

< >

( model batch: Dict )

Generate samples from the model and reference model for the given batch of inputs.

log

< >

( logs: Dict )

Parameters

  • logs (Dict[str, float]) — The values to log.

Log logs on the various objects watching training, including stored metrics.

tokenize_row

< >

( feature model: Union = None )

Tokenize a single row from a CPO specific dataset.

At this stage, we don’t convert to PyTorch tensors yet; we just handle the truncation in case the prompt + chosen or prompt + rejected responses is/are too long. First we truncate the prompt; if we’re still too long, we truncate the chosen/rejected.

We also create the labels for the chosen/rejected responses, which are of length equal to the sum of the length of the prompt and the chosen/rejected response, with label_pad_token_id for the prompt tokens.

DDPOConfig

class trl.DDPOConfig

< >

( exp_name: str = 'doc-buil' run_name: str = '' seed: int = 0 log_with: Optional = None tracker_kwargs: dict = <factory> accelerator_kwargs: dict = <factory> project_kwargs: dict = <factory> tracker_project_name: str = 'trl' logdir: str = 'logs' num_epochs: int = 100 save_freq: int = 1 num_checkpoint_limit: int = 5 mixed_precision: str = 'fp16' allow_tf32: bool = True resume_from: str = '' sample_num_steps: int = 50 sample_eta: float = 1.0 sample_guidance_scale: float = 5.0 sample_batch_size: int = 1 sample_num_batches_per_epoch: int = 2 train_batch_size: int = 1 train_use_8bit_adam: bool = False train_learning_rate: float = 0.0003 train_adam_beta1: float = 0.9 train_adam_beta2: float = 0.999 train_adam_weight_decay: float = 0.0001 train_adam_epsilon: float = 1e-08 train_gradient_accumulation_steps: int = 1 train_max_grad_norm: float = 1.0 train_num_inner_epochs: int = 1 train_cfg: bool = True train_adv_clip_max: float = 5.0 train_clip_range: float = 0.0001 train_timestep_fraction: float = 1.0 per_prompt_stat_tracking: bool = False per_prompt_stat_tracking_buffer_size: int = 16 per_prompt_stat_tracking_min_count: int = 16 async_reward_computation: bool = False max_workers: int = 2 negative_prompts: str = '' )

Parameters

  • exp_name (str, optional, defaults to os.path.basename(sys.argv[0])[ -- -len(".py")]): Name of this experiment (by default is the file name without the extension name).
  • run_name (str, optional, defaults to "") — Name of this run.
  • seed (int, optional, defaults to 0) — Random seed.
  • log_with (Optional[Literal["wandb", "tensorboard"]], optional, defaults to None) — Log with either ‘wandb’ or ‘tensorboard’, check https://huggingface.co/docs/accelerate/usage_guides/tracking for more details.
  • tracker_kwargs (Dict, optional, defaults to {}) — Keyword arguments for the tracker (e.g. wandb_project).
  • accelerator_kwargs (Dict, optional, defaults to {}) — Keyword arguments for the accelerator.
  • project_kwargs (Dict, optional, defaults to {}) — Keyword arguments for the accelerator project config (e.g. logging_dir).
  • tracker_project_name (str, optional, defaults to "trl") — Name of project to use for tracking.
  • logdir (str, optional, defaults to "logs") — Top-level logging directory for checkpoint saving.
  • num_epochs (int, optional, defaults to 100) — Number of epochs to train.
  • save_freq (int, optional, defaults to 1) — Number of epochs between saving model checkpoints.
  • num_checkpoint_limit (int, optional, defaults to 5) — Number of checkpoints to keep before overwriting old ones.
  • mixed_precision (str, optional, defaults to "fp16") — Mixed precision training.
  • allow_tf32 (bool, optional, defaults to True) — Allow tf32 on Ampere GPUs.
  • resume_from (str, optional, defaults to "") — Resume training from a checkpoint.
  • sample_num_steps (int, optional, defaults to 50) — Number of sampler inference steps.
  • sample_eta (float, optional, defaults to 1.0) — Eta parameter for the DDIM sampler.
  • sample_guidance_scale (float, optional, defaults to 5.0) — Classifier-free guidance weight.
  • sample_batch_size (int, optional, defaults to 1) — Batch size (per GPU) to use for sampling.
  • sample_num_batches_per_epoch (int, optional, defaults to 2) — Number of batches to sample per epoch.
  • train_batch_size (int, optional, defaults to 1) — Batch size (per GPU) to use for training.
  • train_use_8bit_adam (bool, optional, defaults to False) — Use 8bit Adam optimizer from bitsandbytes.
  • train_learning_rate (float, optional, defaults to 3e-4) — Learning rate.
  • train_adam_beta1 (float, optional, defaults to 0.9) — Adam beta1.
  • train_adam_beta2 (float, optional, defaults to 0.999) — Adam beta2.
  • train_adam_weight_decay (float, optional, defaults to 1e-4) — Adam weight decay.
  • train_adam_epsilon (float, optional, defaults to 1e-8) — Adam epsilon.
  • train_gradient_accumulation_steps (int, optional, defaults to 1) — Number of gradient accumulation steps.
  • train_max_grad_norm (float, optional, defaults to 1.0) — Maximum gradient norm for gradient clipping.
  • train_num_inner_epochs (int, optional, defaults to 1) — Number of inner epochs per outer epoch.
  • train_cfg (bool, optional, defaults to True) — Whether or not to use classifier-free guidance during training.
  • train_adv_clip_max (float, optional, defaults to 5.0) — Clip advantages to the range.
  • train_clip_range (float, optional, defaults to 1e-4) — PPO clip range.
  • train_timestep_fraction (float, optional, defaults to 1.0) — Fraction of timesteps to train on.
  • per_prompt_stat_tracking (bool, optional, defaults to False) — Whether to track statistics for each prompt separately.
  • per_prompt_stat_tracking_buffer_size (int, optional, defaults to 16) — Number of reward values to store in the buffer for each prompt.
  • per_prompt_stat_tracking_min_count (int, optional, defaults to 16) — Minimum number of reward values to store in the buffer.
  • async_reward_computation (bool, optional, defaults to False) — Whether to compute rewards asynchronously.
  • max_workers (int, optional, defaults to 2) — Maximum number of workers to use for async reward computation.
  • negative_prompts (Optional[str], optional, defaults to "") — Comma-separated list of prompts to use as negative examples.

Configuration class for the DDPOTrainer.

Using HfArgumentParser we can turn this class into argparse arguments that can be specified on the command line.

DDPOTrainer

class trl.DDPOTrainer

< >

( config: DDPOConfig reward_function: Callable prompt_function: Callable sd_pipeline: DDPOStableDiffusionPipeline image_samples_hook: Optional = None )

Parameters

  • **config** (DDPOConfig) — Configuration object for DDPOTrainer. Check the documentation of PPOConfig for more — details.
  • **reward_function** (Callable[[torch.Tensor, Tuple[str], Tuple[Any]], torch.Tensor]) — Reward function to be used —
  • **prompt_function** (Callable[[], Tuple[str, Any]]) — Function to generate prompts to guide model —
  • **sd_pipeline** (DDPOStableDiffusionPipeline) — Stable Diffusion pipeline to be used for training. —
  • **image_samples_hook** (Optional[Callable[[Any, Any, Any], Any]]) — Hook to be called to log images —

The DDPOTrainer uses Deep Diffusion Policy Optimization to optimise diffusion models. Note, this trainer is heavily inspired by the work here: https://github.com/kvablack/ddpo-pytorch As of now only Stable Diffusion based pipelines are supported

calculate_loss

< >

( latents timesteps next_latents log_probs advantages embeds )

Parameters

  • latents (torch.Tensor) — The latents sampled from the diffusion model, shape: [batch_size, num_channels_latents, height, width]
  • timesteps (torch.Tensor) — The timesteps sampled from the diffusion model, shape: [batch_size]
  • next_latents (torch.Tensor) — The next latents sampled from the diffusion model, shape: [batch_size, num_channels_latents, height, width]
  • log_probs (torch.Tensor) — The log probabilities of the latents, shape: [batch_size]
  • advantages (torch.Tensor) — The advantages of the latents, shape: [batch_size]
  • embeds (torch.Tensor) — The embeddings of the prompts, shape: [2*batch_size or batch_size, …] Note: the “or” is because if train_cfg is True, the expectation is that negative prompts are concatenated to the embeds

Calculate the loss for a batch of an unpacked sample

create_model_card

< >

( path: str model_name: Optional = 'TRL DDPO Model' )

Parameters

  • path (str) — The path to save the model card to.
  • model_name (str, optional) — The name of the model, defaults to TRL DDPO Model.

Creates and saves a model card for a TRL model.

step

< >

( epoch: int global_step: int ) global_step (int)

Parameters

  • epoch (int) — The current epoch.
  • global_step (int) — The current global step.

Returns

global_step (int)

The updated global step.

Perform a single step of training.

Side Effects:

  • Model weights are updated
  • Logs the statistics to the accelerator trackers.
  • If self.image_samples_callback is not None, it will be called with the prompt_image_pairs, global_step, and the accelerator tracker.

train

< >

( epochs: Optional = None )

Train the model for a given number of epochs

DPOTrainer

class trl.DPOTrainer

< >

( model: Union = None ref_model: Union = None beta: float = 0.1 label_smoothing: float = 0 loss_type: Optional = None args: Optional = None data_collator: Optional = None label_pad_token_id: int = -100 padding_value: Optional = None truncation_mode: str = 'keep_end' train_dataset: Optional = None eval_dataset: Union = None tokenizer: Optional = None model_init: Optional = None callbacks: Optional = None optimizers: Tuple = (None, None) preprocess_logits_for_metrics: Optional = None max_length: Optional = None max_prompt_length: Optional = None max_target_length: Optional = None peft_config: Optional = None is_encoder_decoder: Optional = None disable_dropout: bool = True generate_during_eval: bool = False compute_metrics: Optional = None precompute_ref_log_probs: bool = False dataset_num_proc: Optional = None model_init_kwargs: Optional = None ref_model_init_kwargs: Optional = None model_adapter_name: Optional = None ref_adapter_name: Optional = None reference_free: bool = False force_use_ref_model: bool = False )

Parameters

  • model (transformers.PreTrainedModel) — The model to train, preferably an AutoModelForSequenceClassification.
  • ref_model (PreTrainedModelWrapper) — Hugging Face transformer model with a casual language modelling head. Used for implicit reward computation and loss. If no reference model is provided, the trainer will create a reference model with the same architecture as the model to be optimized.
  • args (DPOConfig) — The DPO config arguments to use for training.
  • data_collator (transformers.DataCollator) — The data collator to use for training. If None is specified, the default data collator (DPODataCollatorWithPadding) will be used which will pad the sequences to the maximum length of the sequences in the batch, given a dataset of paired sequences.
  • train_dataset (datasets.Dataset) — The dataset to use for training.
  • eval_dataset (datasets.Dataset) — The dataset to use for evaluation.
  • tokenizer (transformers.PreTrainedTokenizerBase) — The tokenizer to use for training. This argument is required if you want to use the default data collator.
  • model_init (Callable[[], transformers.PreTrainedModel]) — The model initializer to use for training. If None is specified, the default model initializer will be used.
  • callbacks (List[transformers.TrainerCallback]) — The callbacks to use for training.
  • optimizers (Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR]) — The optimizer and scheduler to use for training.
  • preprocess_logits_for_metrics (Callable[[torch.Tensor, torch.Tensor], torch.Tensor]) — The function to use to preprocess the logits before computing the metrics.
  • peft_config (Dict, defaults to None) — The PEFT configuration to use for training. If you pass a PEFT configuration, the model will be wrapped in a PEFT model.
  • compute_metrics (Callable[[EvalPrediction], Dict], optional) — The function to use to compute the metrics. Must take a EvalPrediction and return a dictionary string to metric values.

Initialize DPOTrainer.

compute_reference_log_probs

< >

( padded_batch: Dict )

Computes log probabilities of the reference model for a single padded batch of a DPO specific dataset.

concatenated_forward

< >

( model: Module batch: Dict )

Run the given model on the given batch of inputs, concatenating the chosen and rejected inputs together.

We do this to avoid doing two forward passes, because it’s faster for FSDP.

concatenated_inputs

< >

( batch: Dict is_encoder_decoder: bool = False is_vision_model: bool = False label_pad_token_id: int = -100 padding_value: int = 0 device: Optional = None )

Concatenate the chosen and rejected inputs into a single tensor.

dpo_loss

< >

( policy_chosen_logps: FloatTensor policy_rejected_logps: FloatTensor reference_chosen_logps: FloatTensor reference_rejected_logps: FloatTensor ) A tuple of three tensors

Returns

A tuple of three tensors

(losses, chosen_rewards, rejected_rewards). The losses tensor contains the DPO loss for each example in the batch. The chosen_rewards and rejected_rewards tensors contain the rewards for the chosen and rejected responses, respectively.

Compute the DPO loss for a batch of policy and reference model log probabilities.

evaluation_loop

< >

( dataloader: DataLoader description: str prediction_loss_only: Optional = None ignore_keys: Optional = None metric_key_prefix: str = 'eval' )

Overriding built-in evaluation loop to store metrics for each batch. Prediction/evaluation loop, shared by Trainer.evaluate() and Trainer.predict().

Works both with or without labels.

get_batch_logps

< >

( logits: FloatTensor labels: LongTensor label_pad_token_id: int = -100 is_encoder_decoder: bool = False )

Compute the log probabilities of the given labels under the given logits.

get_batch_loss_metrics

< >

( model batch: Dict train_eval: Literal = 'train' )

Compute the DPO loss and other metrics for the given batch of inputs for train or test.

get_batch_samples

< >

( model batch: Dict )

Generate samples from the model and reference model for the given batch of inputs.

get_eval_dataloader

< >

( eval_dataset: Optional = None )

Parameters

  • eval_dataset (torch.utils.data.Dataset, optional) — If provided, will override self.eval_dataset. If it is a Dataset, columns not accepted by the model.forward() method are automatically removed. It must implement __len__.

Returns the evaluation ~torch.utils.data.DataLoader.

Subclass of transformers.src.transformers.trainer.get_eval_dataloader to precompute ref_log_probs.

get_train_dataloader

< >

( )

Returns the training ~torch.utils.data.DataLoader.

Subclass of transformers.src.transformers.trainer.get_train_dataloader to precompute ref_log_probs.

log

< >

( logs: Dict )

Parameters

  • logs (Dict[str, float]) — The values to log.

Log logs on the various objects watching training, including stored metrics.

null_ref_context

< >

( )

Context manager for handling null reference model (that is, peft adapter manipulation).

IterativeSFTTrainer

class trl.IterativeSFTTrainer

< >

( model: Optional = None args: Optional = None tokenizer: Optional = None optimizers: Tuple = (None, None) data_collator: Optional = None eval_dataset: Union = None max_length: Optional = None truncation_mode: Optional = 'keep_end' preprocess_logits_for_metrics: Optional = None compute_metrics: Optional = None optimize_device_cache: Optional = False )

Parameters

  • model (PreTrainedModel) — Model to be optimized, either an ‘AutoModelForCausalLM’ or an ‘AutoModelForSeq2SeqLM’. Check the documentation of PreTrainedModel for more details.
  • args (transformers.TrainingArguments) — The arguments to use for training.
  • tokenizer (PreTrainedTokenizerBase) — Tokenizer to be used for encoding the data. Check the documentation of transformers.PreTrainedTokenizer and transformers.PreTrainedTokenizerFast for more details.
  • optimizers (Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR]) — The optimizer and scheduler to use for training.
  • data_collator (Union[DataCollatorForLanguageModeling, DataCollatorForSeq2Seq], optional) — Data collator to be used for training and passed along the dataloader.
  • eval_dataset (datasets.Dataset) — The dataset to use for evaluation.
  • max_length (int, defaults to None) — The maximum length of the input.
  • truncation_mode (str, defaults to keep_end) — The truncation mode to use, either keep_end or keep_start.
  • preprocess_logits_for_metrics (Callable[[torch.Tensor, torch.Tensor], torch.Tensor]) — The function to use to preprocess the logits before computing the metrics.
  • compute_metrics (Callable[[EvalPrediction], Dict], optional) — The function to use to compute the metrics. Must take a EvalPrediction and return a dictionary string to metric values.
  • optimize_device_cache (bool, optional, defaults to False) — Optimize CUDA cache for slightly more memory-efficient training.

The IterativeSFTTrainer can be used to finetune models with methods that requires some steps between optimization.

step

< >

( input_ids: Optional = None attention_mask: Optional = None labels: Optional = None texts: Optional = None texts_labels: Optional = None ) dict[str, Any]

Parameters

  • input_ids (Listtorch.LongTensor) — List of tensors containing the input_ids (if not provided, text will be used)
  • attention_mask (Listtorch.LongTensor, , optional) — List of tensors containing the attention_mask
  • labels (Listtorch.FloatTensor, optional) — List of tensors containing the labels (if set to None, will default to input_ids)
  • texts (Liststr, optional) — List of strings containing the text input (if not provided, input_ids will directly be used)
  • texts_labels (Liststr, optional) — List of strings containing the text labels (if set to None, will default to text)

Returns

dict[str, Any]

A summary of the training statistics

Run an optimisation step given a list of input_ids, attention_mask, and labels or a list of text and text_labels.

KTOConfig

class trl.KTOConfig

< >

( output_dir: str overwrite_output_dir: bool = False do_train: bool = False do_eval: bool = False do_predict: bool = False eval_strategy: Union = 'no' prediction_loss_only: bool = False per_device_train_batch_size: int = 8 per_device_eval_batch_size: int = 8 per_gpu_train_batch_size: Optional = None per_gpu_eval_batch_size: Optional = None gradient_accumulation_steps: int = 1 eval_accumulation_steps: Optional = None eval_delay: Optional = 0 torch_empty_cache_steps: Optional = None learning_rate: float = 5e-07 weight_decay: float = 0.0 adam_beta1: float = 0.9 adam_beta2: float = 0.999 adam_epsilon: float = 1e-08 max_grad_norm: float = 1.0 num_train_epochs: float = 3.0 max_steps: int = -1 lr_scheduler_type: Union = 'linear' lr_scheduler_kwargs: Union = <factory> warmup_ratio: float = 0.0 warmup_steps: int = 0 log_level: Optional = 'passive' log_level_replica: Optional = 'warning' log_on_each_node: bool = True logging_dir: Optional = None logging_strategy: Union = 'steps' logging_first_step: bool = False logging_steps: float = 500 logging_nan_inf_filter: bool = True save_strategy: Union = 'steps' save_steps: float = 500 save_total_limit: Optional = None save_safetensors: Optional = True save_on_each_node: bool = False save_only_model: bool = False restore_callback_states_from_checkpoint: bool = False no_cuda: bool = False use_cpu: bool = False use_mps_device: bool = False seed: int = 42 data_seed: Optional = None jit_mode_eval: bool = False use_ipex: bool = False bf16: bool = False fp16: bool = False fp16_opt_level: str = 'O1' half_precision_backend: str = 'auto' bf16_full_eval: bool = False fp16_full_eval: bool = False tf32: Optional = None local_rank: int = -1 ddp_backend: Optional = None tpu_num_cores: Optional = None tpu_metrics_debug: bool = False debug: Union = '' dataloader_drop_last: bool = False eval_steps: Optional = None dataloader_num_workers: int = 0 dataloader_prefetch_factor: Optional = None past_index: int = -1 run_name: Optional = None disable_tqdm: Optional = None remove_unused_columns: Optional = True label_names: Optional = None load_best_model_at_end: Optional = False metric_for_best_model: Optional = None greater_is_better: Optional = None ignore_data_skip: bool = False fsdp: Union = '' fsdp_min_num_params: int = 0 fsdp_config: Union = None fsdp_transformer_layer_cls_to_wrap: Optional = None accelerator_config: Union = None deepspeed: Union = None label_smoothing_factor: float = 0.0 optim: Union = 'adamw_torch' optim_args: Optional = None adafactor: bool = False group_by_length: bool = False length_column_name: Optional = 'length' report_to: Union = None ddp_find_unused_parameters: Optional = None ddp_bucket_cap_mb: Optional = None ddp_broadcast_buffers: Optional = None dataloader_pin_memory: bool = True dataloader_persistent_workers: bool = False skip_memory_metrics: bool = True use_legacy_prediction_loop: bool = False push_to_hub: bool = False resume_from_checkpoint: Optional = None hub_model_id: Optional = None hub_strategy: Union = 'every_save' hub_token: Optional = None hub_private_repo: bool = False hub_always_push: bool = False gradient_checkpointing: bool = False gradient_checkpointing_kwargs: Union = None include_inputs_for_metrics: bool = False eval_do_concat_batches: bool = True fp16_backend: str = 'auto' evaluation_strategy: Union = None push_to_hub_model_id: Optional = None push_to_hub_organization: Optional = None push_to_hub_token: Optional = None mp_parameters: str = '' auto_find_batch_size: bool = False full_determinism: bool = False torchdynamo: Optional = None ray_scope: Optional = 'last' ddp_timeout: Optional = 1800 torch_compile: bool = False torch_compile_backend: Optional = None torch_compile_mode: Optional = None dispatch_batches: Optional = None split_batches: Optional = None include_tokens_per_second: Optional = False include_num_input_tokens_seen: Optional = False neftune_noise_alpha: Optional = None optim_target_modules: Union = None batch_eval_metrics: bool = False eval_on_start: bool = False use_liger_kernel: Optional = False eval_use_gather_object: Optional = False max_length: Optional = None max_prompt_length: Optional = None max_completion_length: Optional = None beta: float = 0.1 loss_type: Literal = 'kto' desirable_weight: float = 1.0 undesirable_weight: float = 1.0 label_pad_token_id: int = -100 padding_value: Optional = None truncation_mode: str = 'keep_end' generate_during_eval: bool = False is_encoder_decoder: Optional = None precompute_ref_log_probs: bool = False model_init_kwargs: Optional = None ref_model_init_kwargs: Optional = None dataset_num_proc: Optional = None )

Parameters

  • learning_rate (float, optional, defaults to 5e-7) — Initial learning rate for AdamW optimizer. The default value replaces that of TrainingArguments.
  • max_length (Optional[int], optional, defaults to None) — Maximum length of the sequences (prompt + completion) in the batch. This argument is required if you want to use the default data collator.
  • max_prompt_length (Optional[int], optional, defaults to None) — Maximum length of the prompt. This argument is required if you want to use the default data collator.
  • max_completion_length (Optional[int], optional, defaults to None) — Maximum length of the completion. This argument is required if you want to use the default data collator and your model is an encoder-decoder.
  • beta (float, optional, defaults to 0.1) — Parameter controlling the deviation from the reference model. Higher β means less deviation from the reference model.
  • loss_type (str, optional, defaults to "kto") — Type of loss to use. Possible values are:

    • "kto": KTO loss from the KTO paper.
    • "apo_zero_unpaired": Unpaired variant of APO-zero loss from the APO paper.
  • desirable_weight (float, optional, defaults to 1.0) — Desirable losses are weighed by this factor to counter unequal number of desirable and undesirable paris.
  • undesirable_weight (float, optional, defaults to 1.0) — Undesirable losses are weighed by this factor to counter unequal number of desirable and undesirable pairs.
  • label_pad_token_id (int, optional, defaults to -100) — Label pad token id. This argument is required if you want to use the default data collator.
  • padding_value (Optional[int], optional, defaults to None) — Padding value to use. If None, the padding value of the tokenizer is used.
  • truncation_mode (str, optional, defaults to "keep_end") — Truncation mode to use when the prompt is too long. Possible values are "keep_end" or "keep_start". This argument is required if you want to use the default data collator.
  • generate_during_eval (bool, optional, defaults to False) — If True, generates and logs completions from both the model and the reference model to W&B during evaluation.
  • is_encoder_decoder (Optional[bool], optional, defaults to None) — When using the model_init argument (callable) to instantiate the model instead of the model argument, you need to specify if the model returned by the callable is an encoder-decoder model.
  • precompute_ref_log_probs (bool, optional, defaults to False) — Whether to precompute reference model log probabilities for training and evaluation datasets. This is useful when training without the reference model to reduce the total GPU memory needed.
  • model_init_kwargs (Optional[Dict[str, Any]], optional, defaults to None) — Keyword arguments to pass to AutoModelForCausalLM.from_pretrained when instantiating the model from a string.
  • ref_model_init_kwargs (Optional[Dict[str, Any]], optional, defaults to None) — Keyword arguments to pass to AutoModelForCausalLM.from_pretrained when instantiating the reference model from a string. dataset_num_proc — (Optional[int], optional, defaults to None): Number of processes to use for processing the dataset.

Configuration class for the KTOTrainer.

Using HfArgumentParser we can turn this class into argparse arguments that can be specified on the command line.

KTOTrainer

class trl.KTOTrainer

< >

( model: Union = None ref_model: Union = None args: KTOConfig = None train_dataset: Optional = None eval_dataset: Union = None tokenizer: Optional = None data_collator: Optional = None model_init: Optional = None callbacks: Optional = None optimizers: Tuple = (None, None) preprocess_logits_for_metrics: Optional = None peft_config: Optional = None compute_metrics: Optional = None model_adapter_name: Optional = None ref_adapter_name: Optional = None )

Parameters

  • model (transformers.PreTrainedModel) — The model to train, preferably an AutoModelForSequenceClassification.
  • ref_model (PreTrainedModelWrapper) — Hugging Face transformer model with a casual language modelling head. Used for implicit reward computation and loss. If no reference model is provided, the trainer will create a reference model with the same architecture as the model to be optimized.
  • args (KTOConfig) — The arguments to use for training.
  • train_dataset (datasets.Dataset) — The dataset to use for training.
  • eval_dataset (datasets.Dataset) — The dataset to use for evaluation.
  • tokenizer (transformers.PreTrainedTokenizerBase) — The tokenizer to use for training. This argument is required if you want to use the default data collator.
  • data_collator (transformers.DataCollator, optional, defaults to None) — The data collator to use for training. If None is specified, the default data collator (DPODataCollatorWithPadding) will be used which will pad the sequences to the maximum length of the sequences in the batch, given a dataset of paired sequences.
  • model_init (Callable[[], transformers.PreTrainedModel]) — The model initializer to use for training. If None is specified, the default model initializer will be used.
  • callbacks (List[transformers.TrainerCallback]) — The callbacks to use for training.
  • optimizers (Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR]) — The optimizer and scheduler to use for training.
  • preprocess_logits_for_metrics (Callable[[torch.Tensor, torch.Tensor], torch.Tensor]) — The function to use to preprocess the logits before computing the metrics.
  • peft_config (Dict, defaults to None) — The PEFT configuration to use for training. If you pass a PEFT configuration, the model will be wrapped in a PEFT model.
  • disable_dropout (bool, defaults to True) — Whether or not to disable dropouts in model and ref_model.
  • compute_metrics (Callable[[EvalPrediction], Dict], optional) — The function to use to compute the metrics. Must take a EvalPrediction and return a dictionary string to metric values.
  • model_adapter_name (str, defaults to None) — Name of the train target PEFT adapter, when using LoRA with multiple adapters.
  • ref_adapter_name (str, defaults to None) — Name of the reference PEFT adapter, when using LoRA with multiple adapters.

Initialize KTOTrainer.

compute_reference_log_probs

< >

( padded_batch: Dict )

Computes log probabilities of the reference model for a single padded batch of a KTO specific dataset.

evaluation_loop

< >

( dataloader: DataLoader description: str prediction_loss_only: Optional = None ignore_keys: Optional = None metric_key_prefix: str = 'eval' )

Overriding built-in evaluation loop to store metrics for each batch. Prediction/evaluation loop, shared by Trainer.evaluate() and Trainer.predict().

Works both with or without labels.

get_batch_logps

< >

( logits: FloatTensor labels: LongTensor average_log_prob: bool = False label_pad_token_id: int = -100 is_encoder_decoder: bool = False )

Compute the log probabilities of the given labels under the given logits.

get_batch_loss_metrics

< >

( model batch: Dict )

Compute the KTO loss and other metrics for the given batch of inputs for train or test.

get_batch_samples

< >

( model batch: Dict )

Generate samples from the model and reference model for the given batch of inputs.

get_eval_dataloader

< >

( eval_dataset: Optional = None )

Parameters

  • eval_dataset (torch.utils.data.Dataset, optional) — If provided, will override self.eval_dataset. If it is a Dataset, columns not accepted by the model.forward() method are automatically removed. It must implement __len__.

Returns the evaluation ~torch.utils.data.DataLoader.

Subclass of transformers.src.transformers.trainer.get_eval_dataloader to precompute ref_log_probs.

get_train_dataloader

< >

( )

Returns the training ~torch.utils.data.DataLoader.

Subclass of transformers.src.transformers.trainer.get_train_dataloader to precompute ref_log_probs.

kto_loss

< >

( policy_chosen_logps: FloatTensor policy_rejected_logps: FloatTensor policy_KL_logps: FloatTensor reference_chosen_logps: FloatTensor reference_rejected_logps: FloatTensor reference_KL_logps: FloatTensor ) A tuple of four tensors

Returns

A tuple of four tensors

(losses, chosen_rewards, rejected_rewards, KL). The losses tensor contains the KTO loss for each example in the batch. The chosen_rewards and rejected_rewards tensors contain the rewards for the chosen and rejected responses, respectively. The KL tensor contains the detached KL divergence estimate between the policy and reference models.

Compute the KTO loss for a batch of policy and reference model log probabilities.

log

< >

( logs: Dict )

Parameters

  • logs (Dict[str, float]) — The values to log.

Log logs on the various objects watching training, including stored metrics.

null_ref_context

< >

( )

Context manager for handling null reference model (that is, peft adapter manipulation).

ORPOConfig

class trl.ORPOConfig

< >

( output_dir: str overwrite_output_dir: bool = False do_train: bool = False do_eval: bool = False do_predict: bool = False eval_strategy: Union = 'no' prediction_loss_only: bool = False per_device_train_batch_size: int = 8 per_device_eval_batch_size: int = 8 per_gpu_train_batch_size: Optional = None per_gpu_eval_batch_size: Optional = None gradient_accumulation_steps: int = 1 eval_accumulation_steps: Optional = None eval_delay: Optional = 0 torch_empty_cache_steps: Optional = None learning_rate: float = 5e-05 weight_decay: float = 0.0 adam_beta1: float = 0.9 adam_beta2: float = 0.999 adam_epsilon: float = 1e-08 max_grad_norm: float = 1.0 num_train_epochs: float = 3.0 max_steps: int = -1 lr_scheduler_type: Union = 'linear' lr_scheduler_kwargs: Union = <factory> warmup_ratio: float = 0.0 warmup_steps: int = 0 log_level: Optional = 'passive' log_level_replica: Optional = 'warning' log_on_each_node: bool = True logging_dir: Optional = None logging_strategy: Union = 'steps' logging_first_step: bool = False logging_steps: float = 500 logging_nan_inf_filter: bool = True save_strategy: Union = 'steps' save_steps: float = 500 save_total_limit: Optional = None save_safetensors: Optional = True save_on_each_node: bool = False save_only_model: bool = False restore_callback_states_from_checkpoint: bool = False no_cuda: bool = False use_cpu: bool = False use_mps_device: bool = False seed: int = 42 data_seed: Optional = None jit_mode_eval: bool = False use_ipex: bool = False bf16: bool = False fp16: bool = False fp16_opt_level: str = 'O1' half_precision_backend: str = 'auto' bf16_full_eval: bool = False fp16_full_eval: bool = False tf32: Optional = None local_rank: int = -1 ddp_backend: Optional = None tpu_num_cores: Optional = None tpu_metrics_debug: bool = False debug: Union = '' dataloader_drop_last: bool = False eval_steps: Optional = None dataloader_num_workers: int = 0 dataloader_prefetch_factor: Optional = None past_index: int = -1 run_name: Optional = None disable_tqdm: Optional = None remove_unused_columns: Optional = True label_names: Optional = None load_best_model_at_end: Optional = False metric_for_best_model: Optional = None greater_is_better: Optional = None ignore_data_skip: bool = False fsdp: Union = '' fsdp_min_num_params: int = 0 fsdp_config: Union = None fsdp_transformer_layer_cls_to_wrap: Optional = None accelerator_config: Union = None deepspeed: Union = None label_smoothing_factor: float = 0.0 optim: Union = 'adamw_torch' optim_args: Optional = None adafactor: bool = False group_by_length: bool = False length_column_name: Optional = 'length' report_to: Union = None ddp_find_unused_parameters: Optional = None ddp_bucket_cap_mb: Optional = None ddp_broadcast_buffers: Optional = None dataloader_pin_memory: bool = True dataloader_persistent_workers: bool = False skip_memory_metrics: bool = True use_legacy_prediction_loop: bool = False push_to_hub: bool = False resume_from_checkpoint: Optional = None hub_model_id: Optional = None hub_strategy: Union = 'every_save' hub_token: Optional = None hub_private_repo: bool = False hub_always_push: bool = False gradient_checkpointing: bool = False gradient_checkpointing_kwargs: Union = None include_inputs_for_metrics: bool = False eval_do_concat_batches: bool = True fp16_backend: str = 'auto' evaluation_strategy: Union = None push_to_hub_model_id: Optional = None push_to_hub_organization: Optional = None push_to_hub_token: Optional = None mp_parameters: str = '' auto_find_batch_size: bool = False full_determinism: bool = False torchdynamo: Optional = None ray_scope: Optional = 'last' ddp_timeout: Optional = 1800 torch_compile: bool = False torch_compile_backend: Optional = None torch_compile_mode: Optional = None dispatch_batches: Optional = None split_batches: Optional = None include_tokens_per_second: Optional = False include_num_input_tokens_seen: Optional = False neftune_noise_alpha: Optional = None optim_target_modules: Union = None batch_eval_metrics: bool = False eval_on_start: bool = False use_liger_kernel: Optional = False eval_use_gather_object: Optional = False max_length: Optional = None max_prompt_length: Optional = None max_completion_length: Optional = None beta: float = 0.1 disable_dropout: bool = True label_pad_token_id: int = -100 padding_value: Optional = None truncation_mode: str = 'keep_end' generate_during_eval: bool = False is_encoder_decoder: Optional = None model_init_kwargs: Optional = None dataset_num_proc: Optional = None )

Parameters

  • max_length (Optional[int], optional, defaults to None) — Maximum length of the sequences (prompt + completion) in the batch. This argument is required if you want to use the default data collator.
  • max_prompt_length (Optional[int], optional, defaults to None) — Maximum length of the prompt. This argument is required if you want to use the default data collator.
  • max_completion_length (Optional[int], optional, defaults to None) — Maximum length of the completion. This argument is required if you want to use the default data collator and your model is an encoder-decoder.
  • beta (float, optional, defaults to 0.1) — Parameter controlling the relative ratio loss weight in the ORPO loss. In the paper, it is denoted by λ. In the code, it is denoted by alpha.
  • disable_dropout (bool, optional, defaults to True) — Whether to disable dropout in the model.
  • label_pad_token_id (int, optional, defaults to -100) — Label pad token id. This argument is required if you want to use the default data collator.
  • padding_value (Optional[int], optional, defaults to None) — Padding value to use. If None, the padding value of the tokenizer is used.
  • truncation_mode (str, optional, defaults to "keep_end") — Truncation mode to use when the prompt is too long. Possible values are "keep_end" or "keep_start". This argument is required if you want to use the default data collator.
  • generate_during_eval (bool, optional, defaults to False) — If True, generates and logs completions from the model to W&B during evaluation.
  • is_encoder_decoder (Optional[bool], optional, defaults to None) — When using the model_init argument (callable) to instantiate the model instead of the model argument, you need to specify if the model returned by the callable is an encoder-decoder model.
  • model_init_kwargs (Optional[Dict[str, Any]], optional, defaults to None) — Keyword arguments to pass to AutoModelForCausalLM.from_pretrained when instantiating the model from a string.
  • dataset_num_proc (Optional[int], optional, defaults to None) — Number of processes to use for processing the dataset.

Configuration class for the ORPOTrainer.

Using HfArgumentParser we can turn this class into argparse arguments that can be specified on the command line.

ORPOTrainer

class trl.ORPOTrainer

< >

( model: Union = None args: Optional = None data_collator: Optional = None train_dataset: Optional = None eval_dataset: Union = None tokenizer: Optional = None model_init: Optional = None callbacks: Optional = None optimizers: Tuple = (None, None) preprocess_logits_for_metrics: Optional = None peft_config: Optional = None compute_metrics: Optional = None )

Parameters

  • model (transformers.PreTrainedModel) — The model to train, preferably an AutoModelForSequenceClassification.
  • args (ORPOConfig) — The ORPO config arguments to use for training.
  • data_collator (transformers.DataCollator) — The data collator to use for training. If None is specified, the default data collator (DPODataCollatorWithPadding) will be used which will pad the sequences to the maximum length of the sequences in the batch, given a dataset of paired sequences.
  • train_dataset (datasets.Dataset) — The dataset to use for training.
  • eval_dataset (datasets.Dataset) — The dataset to use for evaluation.
  • tokenizer (transformers.PreTrainedTokenizerBase) — The tokenizer to use for training. This argument is required if you want to use the default data collator.
  • model_init (Callable[[], transformers.PreTrainedModel]) — The model initializer to use for training. If None is specified, the default model initializer will be used.
  • callbacks (List[transformers.TrainerCallback]) — The callbacks to use for training.
  • optimizers (Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR]) — The optimizer and scheduler to use for training.
  • preprocess_logits_for_metrics (Callable[[torch.Tensor, torch.Tensor], torch.Tensor]) — The function to use to preprocess the logits before computing the metrics.
  • peft_config (Dict, defaults to None) — The PEFT configuration to use for training. If you pass a PEFT configuration, the model will be wrapped in a PEFT model.
  • compute_metrics (Callable[[EvalPrediction], Dict], optional) — The function to use to compute the metrics. Must take a EvalPrediction and return a dictionary string to metric values.

Initialize ORPOTrainer.

build_tokenized_answer

< >

( prompt answer )

Llama tokenizer does satisfy enc(a + b) = enc(a) + enc(b). It does ensure enc(a + b) = enc(a) + enc(a + b)[len(enc(a)):]. Reference: https://github.com/EleutherAI/lm-evaluation-harness/pull/531#issuecomment-1595586257

concatenated_forward

< >

( model: Module batch: Dict )

Run the given model on the given batch of inputs, concatenating the chosen and rejected inputs together.

We do this to avoid doing two forward passes, because it’s faster for FSDP.

concatenated_inputs

< >

( batch: Dict is_encoder_decoder: bool = False label_pad_token_id: int = -100 padding_value: int = 0 device: Optional = None )

Concatenate the chosen and rejected inputs into a single tensor.

evaluation_loop

< >

( dataloader: DataLoader description: str prediction_loss_only: Optional = None ignore_keys: Optional = None metric_key_prefix: str = 'eval' )

Overriding built-in evaluation loop to store metrics for each batch. Prediction/evaluation loop, shared by Trainer.evaluate() and Trainer.predict().

Works both with or without labels.

get_batch_logps

< >

( logits: FloatTensor labels: LongTensor average_log_prob: bool = False label_pad_token_id: int = -100 is_encoder_decoder: bool = False )

Compute the log probabilities of the given labels under the given logits.

get_batch_loss_metrics

< >

( model batch: Dict train_eval: Literal = 'train' )

Compute the ORPO loss and other metrics for the given batch of inputs for train or test.

get_batch_samples

< >

( model batch: Dict )

Generate samples from the model and reference model for the given batch of inputs.

log

< >

( logs: Dict )

Parameters

  • logs (Dict[str, float]) — The values to log.

Log logs on the various objects watching training, including stored metrics.

odds_ratio_loss

< >

( policy_chosen_logps: FloatTensor policy_rejected_logps: FloatTensor ) A tuple of three tensors

Returns

A tuple of three tensors

(losses, chosen_rewards, rejected_rewards). The losses tensor contains the ORPO loss for each example in the batch. The chosen_rewards and rejected_rewards tensors contain the rewards for the chosen and rejected responses, respectively. The log odds ratio of the chosen responses over the rejected responses ratio for logging purposes. The log(sigmoid(log_odds_chosen)) for logging purposes.

Compute ORPO’s odds ratio (OR) loss for a batch of policy and reference model log probabilities.

tokenize_row

< >

( feature model: Union = None )

Tokenize a single row from a ORPO specific dataset.

At this stage, we don’t convert to PyTorch tensors yet; we just handle the truncation in case the prompt + chosen or prompt + rejected responses is/are too long. First we truncate the prompt; if we’re still too long, we truncate the chosen/rejected.

We also create the labels for the chosen/rejected responses, which are of length equal to the sum of the length of the prompt and the chosen/rejected response, with label_pad_token_id for the prompt tokens.

PPOConfig

class trl.PPOConfig

< >

( exp_name: str = 'doc-buil' seed: int = 0 log_with: Optional = None task_name: Optional = None model_name: str = 'gpt2' query_dataset: str = 'stanfordnlp/imdb' reward_model: str = 'sentiment-analysis:lvwerra/distilbert-imdb' remove_unused_columns: bool = True tracker_kwargs: Annotated = <factory> accelerator_kwargs: Annotated = <factory> project_kwargs: Annotated = <factory> tracker_project_name: str = 'trl' push_to_hub_if_best_kwargs: Annotated = <factory> steps: int = 20000 learning_rate: float = 1.41e-05 adap_kl_ctrl: bool = True init_kl_coef: float = 0.2 kl_penalty: Literal = 'kl' target: float = 6.0 horizon: float = 10000.0 gamma: float = 1.0 lam: float = 0.95 cliprange: float = 0.2 cliprange_value: float = 0.2 vf_coef: float = 0.1 batch_size: int = 128 forward_batch_size: Optional = None mini_batch_size: int = 128 gradient_accumulation_steps: int = 1 world_size: Annotated = None ppo_epochs: int = 4 max_grad_norm: Optional = None optimize_cuda_cache: Optional = None optimize_device_cache: bool = False early_stopping: bool = False target_kl: float = 1.0 compare_steps: int = 1 ratio_threshold: float = 10.0 use_score_scaling: bool = False use_score_norm: bool = False score_clip: Optional = None whiten_rewards: bool = False gradient_checkpointing: bool = False is_encoder_decoder: Optional = None is_peft_model: Optional = None backward_batch_size: Annotated = None global_backward_batch_size: Optional = None global_batch_size: Annotated = None dataset_num_proc: Optional = None )

Parameters

  • exp_name (str, optional, defaults to os.path.basename(__file__)[ -- -len(".py")]): Name of this experiment.
  • seed (int, optional, defaults to 0) — Random seed.
  • log_with (Optional[Literal["wandb", "tensorboard"]], optional, defaults to None) — Log with either "wandb" or "tensorboard". Check tracking for more details.
  • task_name (Optional[str], optional, defaults to None) — Name of task to use - used only for tracking purposes.
  • model_name (Optional[str], optional, defaults to "gpt2") — Name of model to use - used only for tracking purposes.
  • query_dataset (Optional[str], optional, defaults to "stanfordnlp/imdb") — Name of dataset to query - used only for tracking purposes.
  • reward_model (Optional[str], optional, defaults to "sentiment-analysis --lvwerra/distilbert-imdb"): Reward model to use - used only for tracking purposes.
  • remove_unused_columns (bool, optional, defaults to True) — Remove unused columns from the dataset.
  • tracker_kwargs (JSONDict, optional, defaults to {}) — Keyword arguments for the tracker (e.g. python ppo.py --tracker_kwargs='{"wandb": {"entity": "my_wandb_entity", "name": "my_exp_name"}‌}'.
  • accelerator_kwargs (JSONDict, optional, defaults to {}) — Keyword arguments for the accelerator.
  • project_kwargs (JSONDict, optional, defaults to {}) — Keyword arguments for the accelerator project config (e.g. logging_dir).
  • tracker_project_name (str, optional, defaults to "trl") — Name of project to use for tracking.
  • push_to_hub_if_best_kwargs (JSONDict, optional, defaults to {}) — Keyword arguments for pushing model to the hub during training (e.g. repo_id).
  • steps (int, optional, defaults to 20000) — Number of training steps.
  • learning_rate (float, optional, defaults to 1.41e-5) — Learning rate for the optimizer.
  • adap_kl_ctrl (bool, optional, defaults to True) — Use adaptive KL control, otherwise linear.
  • init_kl_coef (Optional[float], optional, defaults to 0.2) — Initial KL penalty coefficient (used for adaptive and linear control).
  • kl_penalty (Literal["kl", "abs", "mse", "full"], optional, defaults to "kl") — kl penalty options. Possible values are:

    • "kl": model_logp - ref_logp
    • "abs": abs(kl)
    • "mse": mean squared error mse(kl)
    • "full": the actual kl for all tokens in the distribution.
  • target (float, optional, defaults to 6.0) — Target KL value for adaptive KL control.
  • horizon (float, optional, defaults to 10000.0) — Horizon for adaptive KL control.
  • gamma (float, optional, defaults to 1.0) — Gamma parameter for advantage calculation.
  • lam (float, optional, defaults to 0.95) — Lambda parameter for advantage calculation.
  • cliprange (float, optional, defaults to 0.2) — Range for clipping in PPO policy gradient loss.
  • cliprange_value (float, optional, defaults to 0.2) — Range for clipping values in loss calculation.
  • vf_coef (float, optional, defaults to 0.1) — Scaling factor for value loss.
  • batch_size (int, optional, defaults to 128) — Number of samples per optimisation step.
  • forward_batch_size (Optional[int], optional, defaults to None) — DEPRECATED: use mini_batch_size instead, which does the same thing.
  • mini_batch_size (int, optional, defaults to 128) — Number of samples optimized in each mini batch.
  • gradient_accumulation_steps (int, optional, defaults to 1) — Number of gradient accumulation steps.
  • world_size (Optional[int], optional, defaults to None) — Number of processes to use for distributed training.
  • ppo_epochs (int, optional, defaults to 4) — Number of optimisation epochs per batch of samples.
  • optimize_device_cache (bool, optional, defaults to False) — Optimize device cache for slightly more memory-efficient training.
  • early_stopping (bool, optional, defaults to False) — Whether to stop the PPO optimization loop early is the KL too high.
  • target_kl (float, optional, defaults to 1.0) — Stop early if we exceed this value by over 50%.
  • compare_steps (int, optional, defaults to 1) — Compare the current step with the previous compare_steps steps.
  • ratio_threshold (float, optional, defaults to 10.0) — Skip mini-batches with high PPO ratios that can cause loss spikes.
  • use_score_scaling (bool, optional, defaults to False) — Use score scaling.
  • use_score_norm (bool, optional, defaults to False) — Use score normalization. Only applicable if use_score_scaling is True.
  • score_clip (Optional[float], optional, defaults to None) — Score clipping.
  • whiten_rewards (bool, optional, defaults to False) — Whiten the rewards before computing advantages.
  • is_encoder_decoder (Optional[bool], optional, defaults to None) — When using the model_init argument (callable) to instantiate the model instead of the model argument, you need to specify if the model returned by the callable is an encoder-decoder model.
  • is_peft_model (Optional[bool], optional, defaults to None) — Whether the model is a PEFT model.
  • backward_batch_size (Optional[int], optional, defaults to None) — Number of samples optimized in an optimizer.step() call.
  • global_backward_batch_size (Optional[int], optional, defaults to None) — Effective backward_batch_size across all processes.
  • global_batch_size (Optional[int], optional, defaults to None) — Effective batch_size across all processes.
  • dataset_num_proc (Optional[int], optional, defaults to None) — Number of processes to use for processing the dataset.

Configuration class for the PPOTrainer.

Using HfArgumentParser we can turn this class into argparse arguments that can be specified on the command line.

PPOTrainer

class trl.PPOTrainer

< >

( config: Optional = None model: Optional = None ref_model: Optional = None tokenizer: Optional = None dataset: Union = None optimizer: Optional = None data_collator: Optional = None num_shared_layers: Optional = None lr_scheduler: Optional = None training_data_collator: Optional = None )

Parameters

  • **config** (PPOConfig) — Configuration object for PPOTrainer. Check the documentation of PPOConfig for more — details.
  • **model** (PreTrainedModelWrapper) — Model to be optimized, Hugging Face transformer model with a value head. — Check the documentation of PreTrainedModelWrapper for more details.
  • **ref_model** (PreTrainedModelWrapper, optional) — Reference model to be used for KL penalty, Hugging Face — transformer model with a casual language modelling head. Check the documentation of PreTrainedModelWrapper for more details. If no reference model is provided, the trainer will create a reference model with the same architecture as the model to be optimized with shared layers.
  • **tokenizer** (PreTrainedTokenizerBase) — Tokenizer to be used for encoding the — data. Check the documentation of transformers.PreTrainedTokenizer and transformers.PreTrainedTokenizerFast for more details.
  • **dataset** (Union[torch.utils.data.Dataset, datasets.Dataset], optional) — PyTorch dataset or Hugging — Face dataset. This is used to create a PyTorch dataloader. If no dataset is provided, the dataloader must be created outside the trainer users needs to design their own dataloader and make sure the batch size that is used is the same as the one specified in the configuration object.
  • **optimizer** (torch.optim.Optimizer, optional) — Optimizer to be used for training. If no optimizer is — provided, the trainer will create an Adam optimizer with the learning rate specified in the configuration object.
  • **data_collator** (DataCollatorForLanguageModeling, optional) — Data collator to be used for training and — passed along the dataloader
  • **num_shared_layers** (int, optional) — Number of layers to be shared between the model and the reference — model, if no reference model is passed. If no number is provided, all the layers will be shared.
  • **lr_scheduler** (torch.optim.lr_scheduler, optional) — Learning rate scheduler to be used for training. —

The PPOTrainer uses Proximal Policy Optimization to optimise language models. Note, this trainer is heavily inspired by the original OpenAI learning to summarize work here: https://github.com/openai/summarize-from-feedback

batched_forward_pass

< >

( model: PreTrainedModelWrapper queries: Tensor responses: Tensor model_inputs: dict return_logits: bool = False response_masks: Optional = None ) (tuple)

Parameters

  • queries (torch.LongTensor) — List of tensors containing the encoded queries, shape (batch_size, query_length)
  • responses (torch.LongTensor) — List of tensors containing the encoded responses, shape (batch_size, response_length)
  • return_logits (bool, optional, defaults to False) — Whether to return all_logits. Set to False if logits are not needed to reduce memory consumption.

Returns

(tuple)

  • all_logprobs (torch.FloatTensor): Log probabilities of the responses, shape (batch_size, response_length)
  • all_ref_logprobs (torch.FloatTensor): Log probabilities of the responses, shape (batch_size, response_length)
  • all_values (torch.FloatTensor): Values of the responses, shape (batch_size, response_length)

Calculate model outputs in multiple batches.

compute_rewards

< >

( scores: FloatTensor logprobs: FloatTensor ref_logprobs: FloatTensor masks: LongTensor ) torch.FloatTensor

Parameters

  • scores (torch.FloatTensor) — Scores from the reward model, shape (batch_size)
  • logprobs (torch.FloatTensor) — Log probabilities of the model, shape (batch_size, response_length)
  • ref_logprobs (torch.FloatTensor) — Log probabilities of the reference model, shape (batch_size, response_length)

Returns

torch.FloatTensor

Per token rewards, shape (batch_size, response_length) torch.FloatTensor: Non score rewards, shape (batch_size, response_length) torch.FloatTensor: KL penalty, shape (batch_size, response_length)

Compute per token rewards from scores and KL-penalty.

create_model_card

< >

( path: str model_name: Optional = 'TRL Model' )

Parameters

  • path (str) — The path to save the model card to.
  • model_name (str, optional) — The name of the model, defaults to TRL Model.

Creates and saves a model card for a TRL model.

gather_stats

< >

( stats ) dict[str, Any]

Parameters

  • stats (dict[str, Any]) —
  • a dictionary of stats to be gathered. The stats should contain torch tensors. —

Returns

dict[str, Any]

A dictionary of stats with the tensors gathered.

Gather stats from all processes. Useful in the context of distributed training.

generate

< >

( query_tensor: Union length_sampler: Optional = None batch_size: int = 4 return_prompt: bool = True generate_ref_response: bool = False **generation_kwargs ) torch.LongTensor

Parameters

  • query_tensor (torch.LongTensor) — A tensor of shape (seq_len) containing query tokens or a list of tensors of shape (seq_len).
  • length_sampler (Callable, optional) — Callable that returns the number of newly generated tokens.
  • batch_size (int, *optional) — Batch size used for generation, defaults to 4.
  • return_prompt (bool, optional) — If set to False the prompt is not returned but only the newly generated tokens, defaults to True.
  • generate_ref_response (bool, optional) — If set to True the reference response is also generated, defaults to False.
  • generation_kwargs (dict[str, Any]) — Keyword arguments for generation.

Returns

torch.LongTensor

A tensor of shape (batch_size, gen_len) containing response tokens.

Generate response with the model given the query tensor. call the generate method of the model.

log_stats

< >

( stats: dict batch: dict rewards: List columns_to_log: Iterable = ('query', 'response') )

Parameters

  • stats (dict[str, Any]) — A dictionary of training stats.
  • batch (dict[str, Any]) — A dictionary of batch data, this contains the queries and responses.
  • rewards (List[torch.FloatTensor]) — A tensor of rewards.

A function that logs all the training stats. Call it at the end of each epoch.

loss

< >

( old_logprobs: FloatTensor values: FloatTensor logits: FloatTensor vpreds: FloatTensor logprobs: FloatTensor mask: LongTensor advantages: FloatTensor returns: FloatTensor )

Parameters

  • old_logprobs (torch.FloatTensor) — Log probabilities of the model, shape (batch_size, response_length)
  • values (torch.FloatTensor) — Values of the value head, shape (batch_size, response_length)
  • rewards (torch.FloatTensor) — Rewards from the reward model, shape (batch_size, response_length)
  • logits (torch.FloatTensor) — Logits of the model, shape (batch_size, response_length, vocab_size)
  • v_pred (torch.FloatTensor) — Values of the value head, shape (batch_size, response_length)
  • logprobs (torch.FloatTensor) — Log probabilities of the model, shape (batch_size, response_length)

Calculate policy and value losses.

prepare_dataloader

< >

( dataset: Union data_collator = None ) torch.utils.data.DataLoader

Parameters

  • dataset (Union[torch.utils.data.Dataset, datasets.Dataset]) — PyTorch dataset or Hugging Face dataset. If a Hugging Face dataset is passed, the dataset will be preprocessed by removing the columns that are not used by the model.
  • data_collator (Optional[function]) — Data collator function.

Returns

torch.utils.data.DataLoader

PyTorch dataloader

Prepare the dataloader for training.

record_step_stats

< >

( kl_coef: float **data ) stats (dict)

Parameters

  • kl_coef (float) — KL coefficient
  • data (dict) — Dictionary of training step data

Returns

stats (dict)

Dictionary of training step statistics

Record training step statistics.

step

< >

( queries: List responses: List scores: List response_masks: Optional = None ) dict[str, Any]

Parameters

  • queries (Listtorch.LongTensor) — List of tensors containing the encoded queries of shape (query_length)
  • responses (Listtorch.LongTensor) — List of tensors containing the encoded responses of shape (response_length)
  • scores (Listtorch.FloatTensor) — List of tensors containing the scores.
  • response_masks (Listtorch.FloatTensor, optional)) — List of tensors containing masks of the response tokens.

Returns

dict[str, Any]

A summary of the training statistics

Run a PPO optimisation step given a list of queries, model responses, and rewards.

train_minibatch

< >

( old_logprobs: FloatTensor values: FloatTensor logprobs: FloatTensor logits: FloatTensor vpreds: FloatTensor mask: LongTensor advantages: FloatTensor returns: FloatTensor ) train_stats (dict[str, torch.Tensor])

Parameters

  • logprobs (torch.FloatTensor) — Log probabilities of the model, shape [mini_batch_size, response_length]
  • values (torch.FloatTensor) — Values of the value head, shape [mini_batch_size, response_length]
  • query (torch.LongTensor) — Encoded queries, shape [mini_batch_size, query_length]
  • response (torch.LongTensor) — Encoded responses, shape [mini_batch_size, response_length]
  • model_input (torch.LongTensor) — Concatenated queries and responses, shape [mini_batch_size, query_length+response_length]

Returns

train_stats (dict[str, torch.Tensor])

Dictionary of training statistics

Train one PPO minibatch

RewardConfig

class trl.RewardConfig

< >

( output_dir: str overwrite_output_dir: bool = False do_train: bool = False do_eval: bool = False do_predict: bool = False eval_strategy: Union = 'no' prediction_loss_only: bool = False per_device_train_batch_size: int = 8 per_device_eval_batch_size: int = 8 per_gpu_train_batch_size: Optional = None per_gpu_eval_batch_size: Optional = None gradient_accumulation_steps: int = 1 eval_accumulation_steps: Optional = None eval_delay: Optional = 0 torch_empty_cache_steps: Optional = None learning_rate: float = 5e-05 weight_decay: float = 0.0 adam_beta1: float = 0.9 adam_beta2: float = 0.999 adam_epsilon: float = 1e-08 max_grad_norm: float = 1.0 num_train_epochs: float = 3.0 max_steps: int = -1 lr_scheduler_type: Union = 'linear' lr_scheduler_kwargs: Union = <factory> warmup_ratio: float = 0.0 warmup_steps: int = 0 log_level: Optional = 'passive' log_level_replica: Optional = 'warning' log_on_each_node: bool = True logging_dir: Optional = None logging_strategy: Union = 'steps' logging_first_step: bool = False logging_steps: float = 500 logging_nan_inf_filter: bool = True save_strategy: Union = 'steps' save_steps: float = 500 save_total_limit: Optional = None save_safetensors: Optional = True save_on_each_node: bool = False save_only_model: bool = False restore_callback_states_from_checkpoint: bool = False no_cuda: bool = False use_cpu: bool = False use_mps_device: bool = False seed: int = 42 data_seed: Optional = None jit_mode_eval: bool = False use_ipex: bool = False bf16: bool = False fp16: bool = False fp16_opt_level: str = 'O1' half_precision_backend: str = 'auto' bf16_full_eval: bool = False fp16_full_eval: bool = False tf32: Optional = None local_rank: int = -1 ddp_backend: Optional = None tpu_num_cores: Optional = None tpu_metrics_debug: bool = False debug: Union = '' dataloader_drop_last: bool = False eval_steps: Optional = None dataloader_num_workers: int = 0 dataloader_prefetch_factor: Optional = None past_index: int = -1 run_name: Optional = None disable_tqdm: Optional = None remove_unused_columns: Optional = True label_names: Optional = None load_best_model_at_end: Optional = False metric_for_best_model: Optional = None greater_is_better: Optional = None ignore_data_skip: bool = False fsdp: Union = '' fsdp_min_num_params: int = 0 fsdp_config: Union = None fsdp_transformer_layer_cls_to_wrap: Optional = None accelerator_config: Union = None deepspeed: Union = None label_smoothing_factor: float = 0.0 optim: Union = 'adamw_torch' optim_args: Optional = None adafactor: bool = False group_by_length: bool = False length_column_name: Optional = 'length' report_to: Union = None ddp_find_unused_parameters: Optional = None ddp_bucket_cap_mb: Optional = None ddp_broadcast_buffers: Optional = None dataloader_pin_memory: bool = True dataloader_persistent_workers: bool = False skip_memory_metrics: bool = True use_legacy_prediction_loop: bool = False push_to_hub: bool = False resume_from_checkpoint: Optional = None hub_model_id: Optional = None hub_strategy: Union = 'every_save' hub_token: Optional = None hub_private_repo: bool = False hub_always_push: bool = False gradient_checkpointing: bool = False gradient_checkpointing_kwargs: Union = None include_inputs_for_metrics: bool = False eval_do_concat_batches: bool = True fp16_backend: str = 'auto' evaluation_strategy: Union = None push_to_hub_model_id: Optional = None push_to_hub_organization: Optional = None push_to_hub_token: Optional = None mp_parameters: str = '' auto_find_batch_size: bool = False full_determinism: bool = False torchdynamo: Optional = None ray_scope: Optional = 'last' ddp_timeout: Optional = 1800 torch_compile: bool = False torch_compile_backend: Optional = None torch_compile_mode: Optional = None dispatch_batches: Optional = None split_batches: Optional = None include_tokens_per_second: Optional = False include_num_input_tokens_seen: Optional = False neftune_noise_alpha: Optional = None optim_target_modules: Union = None batch_eval_metrics: bool = False eval_on_start: bool = False use_liger_kernel: Optional = False eval_use_gather_object: Optional = False max_length: Optional = None dataset_num_proc: Optional = None center_rewards_coefficient: Optional = None )

Parameters

  • max_length (Optional[int], optional, defaults to None) — Maximum length of the sequences (prompt + completion) in the batch. This argument is required if you want to use the default data collator.
  • dataset_num_proc (int, optional, defaults to None) — Number of processes to use for processing the dataset.
  • center_rewards_coefficient (float, optional, defaults to None) — Coefficient to incentivize the reward model to output mean-zero rewards (proposed by https://huggingface.co/papers/2312.09244, Eq. 2). Recommended value: 0.01.

Configuration class for the RewardTrainer.

Using HfArgumentParser we can turn this class into argparse arguments that can be specified on the command line.

RewardTrainer

class trl.RewardTrainer

< >

( model: Union = None args: Optional = None data_collator: Optional = None train_dataset: Optional = None eval_dataset: Union = None tokenizer: Optional = None model_init: Optional = None compute_metrics: Optional = None callbacks: Optional = None optimizers: Tuple = (None, None) preprocess_logits_for_metrics: Optional = None max_length: Optional = None peft_config: Optional = None )

The RewardTrainer can be used to train your custom Reward Model. It is a subclass of the transformers.Trainer class and inherits all of its attributes and methods. It is recommended to use an AutoModelForSequenceClassification as the reward model. The reward model should be trained on a dataset of paired examples, where each example is a tuple of two sequences. The reward model should be trained to predict which example in the pair is more relevant to the task at hand.

The reward trainer expects a very specific format for the dataset. The dataset should contain two 4 entries at least if you don’t use the default RewardDataCollatorWithPadding data collator. The entries should be named

  • input_ids_chosen
  • attention_mask_chosen
  • input_ids_rejected
  • attention_mask_rejected

Optionally, you can also pass a margin entry to the dataset. This entry should contain the margin used to modulate the loss of the reward model as outlined in https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/. If you don’t pass a margin, no margin will be used.

visualize_samples

< >

( num_print_samples: int )

Parameters

  • num_print_samples (int, defaults to 4) — The number of samples to print. Set to -1 to print all samples.

Visualize the reward model logits prediction

SFTTrainer

class trl.SFTTrainer

< >

( model: Union = None args: Optional = None data_collator: Optional = None train_dataset: Optional = None eval_dataset: Union = None tokenizer: Optional = None model_init: Optional = None compute_metrics: Optional = None callbacks: Optional = None optimizers: Tuple = (None, None) preprocess_logits_for_metrics: Optional = None peft_config: Optional = None dataset_text_field: Optional = None packing: Optional = False formatting_func: Optional = None max_seq_length: Optional = None infinite: Optional = None num_of_sequences: Optional = None chars_per_token: Optional = None dataset_num_proc: Optional = None dataset_batch_size: Optional = None neftune_noise_alpha: Optional = None model_init_kwargs: Optional = None dataset_kwargs: Optional = None eval_packing: Optional = None )

Parameters

  • model (Union[transformers.PreTrainedModel, nn.Module, str]) — The model to train, can be a PreTrainedModel, a torch.nn.Module or a string with the model name to load from cache or download. The model can be also converted to a PeftModel if a PeftConfig object is passed to the peft_config argument.
  • args (Optional[SFTConfig]) — The arguments to tweak for training. Will default to a basic instance of SFTConfig with the output_dir set to a directory named tmp_trainer in the current directory if not provided.
  • data_collator (Optional[transformers.DataCollator]) — The data collator to use for training.
  • train_dataset (Optional[datasets.Dataset]) — The dataset to use for training. We recommend users to use trl.trainer.ConstantLengthDataset to create their dataset.
  • eval_dataset (Optional[Union[datasets.Dataset, Dict[str, datasets.Dataset]]]) — The dataset to use for evaluation. We recommend users to use trl.trainer.ConstantLengthDataset to create their dataset.
  • tokenizer (Optional[transformers.PreTrainedTokenizer]) — The tokenizer to use for training. If not specified, the tokenizer associated to the model will be used.
  • model_init (Callable[[], transformers.PreTrainedModel]) — The model initializer to use for training. If None is specified, the default model initializer will be used.
  • compute_metrics (Callable[[transformers.EvalPrediction], Dict], optional defaults to None) — The function used to compute metrics during evaluation. It should return a dictionary mapping metric names to metric values. If not specified, only the loss will be computed during evaluation.
  • callbacks (List[transformers.TrainerCallback]) — The callbacks to use for training.
  • optimizers (Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR]) — The optimizer and scheduler to use for training.
  • preprocess_logits_for_metrics (Callable[[torch.Tensor, torch.Tensor], torch.Tensor]) — The function to use to preprocess the logits before computing the metrics.
  • peft_config (Optional[PeftConfig]) — The PeftConfig object to use to initialize the PeftModel.
  • formatting_func (Optional[Callable]) — The formatting function to be used for creating the ConstantLengthDataset.

Class definition of the Supervised Finetuning Trainer (SFT Trainer). This class is a wrapper around the transformers.Trainer class and inherits all of its attributes and methods. The trainer takes care of properly initializing the PeftModel in case a user passes a PeftConfig object.

set_seed

trl.set_seed

< >

( seed: int )

Parameters

  • seed (int) — The seed to set.

Helper function for reproducible behavior to set the seed in random, numpy, and torch.

< > Update on GitHub