At TRL we support PPO (Proximal Policy Optimisation) with an implementation that largely follows the structure introduced in the paper “Fine-Tuning Language Models from Human Preferences” by D. Ziegler et al. [paper, code].
The Trainer and model classes are largely inspired from transformers.Trainer
and transformers.AutoModel
classes and adapted for RL.
( model_name: typing.Optional[str] = None steps: typing.Optional[int] = 20000 learning_rate: typing.Optional[float] = 1e-05 adap_kl_ctrl: typing.Optional[bool] = True init_kl_coef: typing.Optional[float] = 0.2 target: typing.Optional[float] = 6 horizon: typing.Optional[float] = 10000 gamma: typing.Optional[float] = 1 lam: typing.Optional[float] = 0.95 cliprange: typing.Optional[float] = 0.2 cliprange_value: typing.Optional[float] = 0.2 vf_coef: typing.Optional[float] = 0.1 batch_size: typing.Optional[int] = 256 forward_batch_size: typing.Optional[int] = 1 ppo_epochs: typing.Optional[int] = 4 remove_unused_columns: typing.Optional[bool] = True log_with: typing.Optional[str] = None tracker_kwargs: typing.Optional[dict] = {} accelerator_kwargs: typing.Optional[dict] = {} tracker_project_name: typing.Optional[str] = 'trl' )
Parameters
str
, optional, defaults to None
) —
Name of model to use - used only for tracking purposes
int
, optional, defaults to 20000) —
Number of training steps
float
, optional, defaults to 1.41e-5) —
Adam learning rate
bool
, optional, defaults to True) —
Use adaptive KL control, otherwise linear
float
, optional, defaults to 0.2) —
Initial KL penalty coefficient (used for adaptive and linear control)
float
, optional, defaults to 6) —
Target KL value for adaptive KL control
float
, optional, defaults to 10000) —
Horizon for adaptive KL control
float
, optional, defaults to 1) —
Gamma parameter for advantage calculation
float
, optional, defaults to 0.95) —
Lambda parameter for advantage calcualation
float
, optional, defaults to 0.2) —
Range for clipping in PPO policy gradient loss
float
, optional, defaults to 0.2) —
Range for clipping values in loss calculation
float
, optional, defaults to 0.1) —
Scaling factor for value loss
int
, optional, defaults to 256) —
Number of samples per optimisation step
int
, optional, defaults to 1) —
Number of samples forward passed through model at a time
int
, optional, defaults to 4) —
Number of optimisation epochs per batch of samples
bool
, optional, defaults to True) —
Remove unused columns from the dataset if datasets.Dataset
is used
str
, optional, defaults to None
) —
Log with either “wandb” or “tensorboard”, check
https://huggingface.co/docs/accelerate/usage_guides/tracking for more details
dict
, optional, defaults to {}) —
Keyword arguments for the accelerator (e.g. logging_dir
)
dict
, optional, defaults to {}) —
Keyword arguments for the tracker (e.g. wandb_project)
str
, optional, defaults to “trl”) —
Name of project to use for tracking
Configuration class for PPOTrainer
( config: PPOConfig = None model: PreTrainedModelWrapper = None ref_model: PreTrainedModelWrapper = None tokenizer: typing.Union[transformers.tokenization_utils.PreTrainedTokenizer, transformers.tokenization_utils_fast.PreTrainedTokenizerFast] = None dataset: typing.Union[torch.utils.data.dataset.Dataset, datasets.arrow_dataset.Dataset, NoneType] = None optimizer: typing.Optional[torch.optim.optimizer.Optimizer] = None data_collator = None num_shared_layers: typing.Optional[int] = None lr_scheduler: typing.Optional[torch.optim.lr_scheduler._LRScheduler] = None )
Parameters
PPOConfig
) — Configuration object for PPOTrainer. Check the documentation of PPOConfig
for more details. —
PreTrainedModelWrapper
) — Model to be optimized, Hugging Face transformer model with a value head. —
Check the documentation of PreTrainedModelWrapper
for more details.
PreTrainedModelWrapper
, optional) — Reference model to be used for KL penalty, Hugging Face transformer model with a casual language modelling head. —
Check the documentation of PreTrainedModelWrapper
for more details. If no reference model is provided, the
trainer will create a reference model with the same architecture as the model to be optimized with shared layers.
Union[PreTrainedTokenizer, PreTrainedTokenizerFast]
) — Tokenizer to be used for encoding the data. Check the documentation of transformers.PreTrainedTokenizer
and —
transformers.PreTrainedTokenizerFast
for more details.
torch.utils.data.Dataset
, datasets.Dataset
], optional) — PyTorch dataset or Hugging Face dataset. This is used to create a PyTorch dataloader. If no dataset is provided, —
the dataloader must be created outside the trainer users needs to design their own dataloader and make sure the batch
size that is used is the same as the one specified in the configuration object.
torch.optim.Optimizer
, optional) — Optimizer to be used for training. If no optimizer is provided, the trainer will create an Adam optimizer with —
the learning rate specified in the configuration object.
torch.optim.lr_scheduler
, optional) — Learning rate scheduler to be used for training. —
The PPOTrainer uses Proximal Policy Optimization to optimise language models.
( queries: Tensor responses: Tensor ) → (tuple)
Parameters
torch.LongTensor
) —
List of tensors containing the encoded queries, shape (batch_size
, query_length
)
torch.LongTensor
) —
List of tensors containing the encoded responses, shape (batch_size
, response_length
)
Returns
(tuple)
torch.FloatTensor
): Log probabilities of the responses, shape (batch_size
, response_length
)torch.FloatTensor
): Log probabilities of the responses, shape (batch_size
, response_length
)torch.FloatTensor
): Values of the responses, shape (batch_size
, response_length
)Calculate model outputs in multiple batches.
( scores: FloatTensor logprobs: FloatTensor ref_logprobs: FloatTensor )
Parameters
Compute per token rewards from scores and KL-penalty.
( path: str model_name: typing.Optional[str] = 'TRL Model' )
Creates and saves a model card for a TRL model.
(
stats
)
→
dict[str, Any]
Gather stats from all processes. Useful in the context of distributed training.
(
query_tensor: Tensor
**generation_kwargs
)
→
torch.LongTensor
Generate response with the model given the query tensor.
call the generate
method of the model.
( stats: dict batch: dict rewards: typing.List[torch.FloatTensor] )
A function that logs all the training stats. Call it at the end of each epoch.
( old_logprobs: FloatTensor values: FloatTensor rewards: FloatTensor query: LongTensor response: LongTensor model_input: LongTensor )
Parameters
torch.FloatTensor
) —
Log probabilities of the model, shape (batch_size
, response_length
)
torch.FloatTensor
) —
Values of the value head, shape (batch_size
, hidden_dim
)
torch.FloatTensor
) —
Rewards from the reward model, shape (batch_size
)
torch.LongTensor
) —
Encoded queries, shape (batch_size
, query_length
)
torch.LongTensor
) —
Encoded responses, shape (batch_size
, response_length
)
torch.LongTensor
) —
Concatenated queries and responses, shape (batch_size
, query_length+response_length
)
Calculate policy and value losses.
(
dataset: typing.Union[torch.utils.data.dataset.Dataset, datasets.arrow_dataset.Dataset]
data_collator = None
)
→
torch.utils.data.DataLoader
Parameters
torch.utils.data.Dataset
, datasets.Dataset
]) —
PyTorch dataset or Hugging Face dataset. If a Hugging Face dataset is passed, the dataset
will be preprocessed by removing the columns that are not used by the model.
Returns
torch.utils.data.DataLoader
PyTorch dataloader
Prepare the dataloader for training.
(
kl_coef: float
**data
)
→
stats (dict
)
Record training step statistics.
(
queries: typing.List[torch.LongTensor]
responses: typing.List[torch.LongTensor]
scores: typing.List[torch.FloatTensor]
)
→
dict[str, Any]
Parameters
torch.LongTensor
) —
List of tensors containing the encoded queries of shape (query_length
)
torch.LongTensor
) —
List of tensors containing the encoded responses of shape (response_length
)
torch.FloatTensor
) —
List of tensors containing the scores.
Returns
dict[str, Any]
A summary of the training statistics
Run a PPO optimisation step given a list of queries, model responses, and rewards.
(
logprobs: FloatTensor
values: FloatTensor
rewards: FloatTensor
query: LongTensor
response: LongTensor
model_input: LongTensor
)
→
train_stats (dict[str, torch.Tensor
])
Parameters
torch.FloatTensor
) —
Log probabilities of the model, shape [batch_size, response_length]
torch.FloatTensor
) —
Values of the value head, shape [batch_size, response_length]
torch.FloatTensor
) —
Rewards from the reward model, shape [batch_size, response_length]
torch.LongTensor
) —
Encoded queries, shape [batch_size, query_length]
torch.LongTensor
) —
Encoded responses, shape [batch_size, response_length]
torch.LongTensor
) —
Concatenated queries and responses, shape [batch_size, query_length+response_length]
Returns
train_stats (dict[str, torch.Tensor
])
Dictionary of training statistics
Train one PPO minibatch