TRL documentation

Trainer

TRL

You are viewing v0.4.7 version. A newer version v0.16.0 is available.

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Trainer

At TRL we support PPO (Proximal Policy Optimisation) with an implementation that largely follows the structure introduced in the paper “Fine-Tuning Language Models from Human Preferences” by D. Ziegler et al. [paper, code]. The Trainer and model classes are largely inspired from transformers.Trainer and transformers.AutoModel classes and adapted for RL. We also support a RewardTrainer that can be used to train a reward model.

PPOConfig

class trl.PPOConfig

( task_name: typing.Optional[str] = None model_name: typing.Optional[str] = None steps: typing.Optional[int] = 20000 learning_rate: typing.Optional[float] = 1e-05 adap_kl_ctrl: typing.Optional[bool] = True init_kl_coef: typing.Optional[float] = 0.2 kl_penalty: typing.Optional[str] = 'kl' target: typing.Optional[float] = 6 horizon: typing.Optional[float] = 10000 gamma: typing.Optional[float] = 1 lam: typing.Optional[float] = 0.95 cliprange: typing.Optional[float] = 0.2 cliprange_value: typing.Optional[float] = 0.2 vf_coef: typing.Optional[float] = 0.1 batch_size: typing.Optional[int] = 256 forward_batch_size: typing.Optional[int] = None mini_batch_size: typing.Optional[int] = 1 gradient_accumulation_steps: typing.Optional[int] = 1 ppo_epochs: typing.Optional[int] = 4 remove_unused_columns: typing.Optional[bool] = True log_with: typing.Optional[str] = None tracker_kwargs: typing.Optional[dict] = <factory> accelerator_kwargs: typing.Optional[dict] = <factory> project_kwargs: typing.Optional[dict] = <factory> tracker_project_name: typing.Optional[str] = 'trl' max_grad_norm: typing.Optional[float] = None seed: typing.Optional[int] = 0 optimize_cuda_cache: typing.Optional[bool] = False early_stopping: typing.Optional[bool] = False target_kl: typing.Optional[float] = 0.1 push_to_hub_if_best_kwargs: typing.Optional[dict] = <factory> compare_steps: typing.Optional[int] = 1 ratio_threshold: typing.Optional[float] = 10.0 )

Configuration class for PPOTrainer

PPOTrainer

class trl.PPOTrainer

( config: PPOConfig = None model: PreTrainedModelWrapper = None ref_model: typing.Optional[trl.models.modeling_base.PreTrainedModelWrapper] = None tokenizer: PreTrainedTokenizerBase = None dataset: typing.Union[torch.utils.data.dataset.Dataset, datasets.arrow_dataset.Dataset, NoneType] = None optimizer: typing.Optional[torch.optim.optimizer.Optimizer] = None data_collator: typing.Optional[typing.Callable] = None num_shared_layers: typing.Optional[int] = None lr_scheduler: typing.Optional[torch.optim.lr_scheduler._LRScheduler] = None )

Parameters

**config** (PPOConfig) — Configuration object for PPOTrainer. Check the documentation of PPOConfig for more — details.
**model** (PreTrainedModelWrapper) — Model to be optimized, Hugging Face transformer model with a value head. — Check the documentation of PreTrainedModelWrapper for more details.
**ref_model** (PreTrainedModelWrapper, optional) — Reference model to be used for KL penalty, Hugging Face — transformer model with a casual language modelling head. Check the documentation of PreTrainedModelWrapper for more details. If no reference model is provided, the trainer will create a reference model with the same architecture as the model to be optimized with shared layers.
**tokenizer** (PreTrainedTokenizerBase) — Tokenizer to be used for encoding the — data. Check the documentation of transformers.PreTrainedTokenizer and transformers.PreTrainedTokenizerFast for more details.
**dataset** (Union[torch.utils.data.Dataset, datasets.Dataset], optional) — PyTorch dataset or Hugging — Face dataset. This is used to create a PyTorch dataloader. If no dataset is provided, the dataloader must be created outside the trainer users needs to design their own dataloader and make sure the batch size that is used is the same as the one specified in the configuration object.
**optimizer** (torch.optim.Optimizer, optional) — Optimizer to be used for training. If no optimizer is — provided, the trainer will create an Adam optimizer with the learning rate specified in the configuration object.
**data_collator** (DataCollatorForLanguageModeling, optional) — Data collator to be used for training and — passed along the dataloader
**num_shared_layers** (int, optional) — Number of layers to be shared between the model and the reference — model, if no reference model is passed. If no number is provided, all the layers will be shared.
**lr_scheduler** (torch.optim.lr_scheduler, optional) — Learning rate scheduler to be used for training. —

The PPOTrainer uses Proximal Policy Optimization to optimise language models. Note, this trainer is heavily inspired by the original OpenAI learning to summarize work here: https://github.com/openai/summarize-from-feedback

batched_forward_pass

( model: PreTrainedModelWrapper queries: Tensor responses: Tensor model_inputs: dict return_logits: bool = False ) → (tuple)

Parameters

queries (torch.LongTensor) — List of tensors containing the encoded queries, shape (batch_size, query_length)
responses (torch.LongTensor) — List of tensors containing the encoded responses, shape (batch_size, response_length)
return_logits (bool, optional, defaults to False) — Whether to return all_logits. Set to False if logits are not needed to reduce memory consumption.

Returns

(tuple)

all_logprobs (torch.FloatTensor): Log probabilities of the responses, shape (batch_size, response_length)
all_ref_logprobs (torch.FloatTensor): Log probabilities of the responses, shape (batch_size, response_length)
all_values (torch.FloatTensor): Values of the responses, shape (batch_size, response_length)

Calculate model outputs in multiple batches.

compute_rewards

( scores: FloatTensor logprobs: FloatTensor ref_logprobs: FloatTensor masks: LongTensor )

Parameters

scores (torch.FloatTensor) — Scores from the reward model, shape (batch_size)
logprobs (torch.FloatTensor) — Log probabilities of the model, shape (batch_size, response_length)
ref_logprobs (torch.FloatTensor) — Log probabilities of the reference model, shape (batch_size, response_length)

Compute per token rewards from scores and KL-penalty.

create_model_card

( path: str model_name: typing.Optional[str] = 'TRL Model' )

Parameters

path (str) — The path to save the model card to.
model_name (str, optional) — The name of the model, defaults to TRL Model.

Creates and saves a model card for a TRL model.

gather_stats

( stats ) → dict[str, Any]

Parameters

stats (dict[str, Any]) —
a dictionary of stats to be gathered. The stats should contain torch tensors. —

Returns

dict[str, Any]

A dictionary of stats with the tensors gathered.

Gather stats from all processes. Useful in the context of distributed training.

generate

( query_tensor: typing.Union[torch.Tensor, typing.List[torch.Tensor]] length_sampler: typing.Callable = None batch_size: int = 4 return_prompt: bool = True **generation_kwargs ) → torch.LongTensor

Parameters

query_tensor (torch.LongTensor) — A tensor of shape (batch_size, seq_len) containing query tokens.
generation_kwargs (dict[str, Any]) — Keyword arguments for generation.
length_sampler (Callable, optional) — Callable that returns the number of newly generated tokens.
batch_size (int, *optional) — Batch size used for generation, defaults to 4.
return_prompt (bool, optional) — If set to False the prompt is not returned but only the newly generated tokens, defaults to True.

Returns

torch.LongTensor

A tensor of shape (batch_size, gen_len) containing response tokens.

Generate response with the model given the query tensor. call the generate method of the model.

log_stats

( stats: dict batch: dict rewards: typing.List[torch.FloatTensor] )

Parameters

stats (dict[str, Any]) — A dictionary of training stats.
batch (dict[str, Any]) — A dictionary of batch data, this contains the queries and responses.
rewards (List[torch.FloatTensor]) — A tensor of rewards.

A function that logs all the training stats. Call it at the end of each epoch.

loss

( old_logprobs: FloatTensor values: FloatTensor rewards: FloatTensor logits: FloatTensor vpreds: FloatTensor logprobs: FloatTensor mask: LongTensor )

Parameters

old_logprobs (torch.FloatTensor) — Log probabilities of the model, shape (batch_size, response_length)
values (torch.FloatTensor) — Values of the value head, shape (batch_size, response_length)
rewards (torch.FloatTensor) — Rewards from the reward model, shape (batch_size, response_length)
logits (torch.FloatTensor) — Logits of the model, shape (batch_size, response_length, vocab_size)
v_pred (torch.FloatTensor) — Values of the value head, shape (batch_size, response_length)
logprobs (torch.FloatTensor) — Log probabilities of the model, shape (batch_size, response_length)

Calculate policy and value losses.

prepare_dataloader

( dataset: typing.Union[torch.utils.data.dataset.Dataset, datasets.arrow_dataset.Dataset] data_collator = None ) → torch.utils.data.DataLoader

Parameters

dataset (Union[torch.utils.data.Dataset, datasets.Dataset]) — PyTorch dataset or Hugging Face dataset. If a Hugging Face dataset is passed, the dataset will be preprocessed by removing the columns that are not used by the model.
data_collator (Optional[function]) — Data collator function.

Returns

torch.utils.data.DataLoader

PyTorch dataloader

Prepare the dataloader for training.

record_step_stats

( kl_coef: float **data ) → stats (dict)

Parameters

kl_coef (float) — KL coefficient
data (dict) — Dictionary of training step data

Returns

stats (dict)

Dictionary of training step statistics

Record training step statistics.

step

( queries: typing.List[torch.LongTensor] responses: typing.List[torch.LongTensor] scores: typing.List[torch.FloatTensor] ) → dict[str, Any]

Parameters

queries (Listtorch.LongTensor) — List of tensors containing the encoded queries of shape (query_length)
responses (Listtorch.LongTensor) — List of tensors containing the encoded responses of shape (response_length)
scores (Listtorch.FloatTensor) — List of tensors containing the scores.

Returns

dict[str, Any]

A summary of the training statistics

Run a PPO optimisation step given a list of queries, model responses, and rewards.

train_minibatch

( old_logprobs: FloatTensor values: FloatTensor rewards: FloatTensor logprobs: FloatTensor logits: FloatTensor vpreds: FloatTensor mask: LongTensor ) → train_stats (dict[str, torch.Tensor])

Parameters

logprobs (torch.FloatTensor) — Log probabilities of the model, shape [batch_size, response_length]
values (torch.FloatTensor) — Values of the value head, shape [batch_size, response_length]
rewards (torch.FloatTensor) — Rewards from the reward model, shape [batch_size, response_length]
query (torch.LongTensor) — Encoded queries, shape [batch_size, query_length]
response (torch.LongTensor) — Encoded responses, shape [batch_size, response_length]
model_input (torch.LongTensor) — Concatenated queries and responses, shape [batch_size, query_length+response_length]

Returns

train_stats (dict[str, torch.Tensor])

Dictionary of training statistics

Train one PPO minibatch

RewardTrainer

class trl.RewardTrainer

( model: typing.Union[transformers.modeling_utils.PreTrainedModel, torch.nn.modules.module.Module] = None args: TrainingArguments = None data_collator: typing.Optional[DataCollator] = None train_dataset: typing.Optional[datasets.arrow_dataset.Dataset] = None eval_dataset: typing.Union[datasets.arrow_dataset.Dataset, typing.Dict[str, datasets.arrow_dataset.Dataset], NoneType] = None tokenizer: typing.Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase] = None model_init: typing.Union[typing.Callable[[], transformers.modeling_utils.PreTrainedModel], NoneType] = None compute_metrics: typing.Union[typing.Callable[[transformers.trainer_utils.EvalPrediction], typing.Dict], NoneType] = None callbacks: typing.Optional[typing.List[transformers.trainer_callback.TrainerCallback]] = None optimizers: typing.Tuple[torch.optim.optimizer.Optimizer, torch.optim.lr_scheduler.LambdaLR] = (None, None) preprocess_logits_for_metrics: typing.Union[typing.Callable[[torch.Tensor, torch.Tensor], torch.Tensor], NoneType] = None max_length: typing.Optional[int] = None peft_config: typing.Optional[typing.Dict] = None )

The RewardTrainer can be used to train your custom Reward Model. It is a subclass of the transformers.Trainer class and inherits all of its attributes and methods. It is recommended to use an AutoModelForSequenceClassification as the reward model. The reward model should be trained on a dataset of paired examples, where each example is a tuple of two sequences. The reward model should be trained to predict which example in the pair is more relevant to the task at hand.

The reward trainer expects a very specific format for the dataset. The dataset should contain two 4 entries at least if you don’t use the default RewardDataCollatorWithPadding data collator. The entries should be named

input_ids_chosen
attention_mask_chosen
input_ids_rejected
attention_mask_rejected

set_seed

trl.set_seed

( seed: int )

Parameters

seed (int) — The seed to set.

Helper function for reproducible behavior to set the seed in random, numpy, and torch.

←Model Classes Training your own reward model→