Reward Modeling
TRL supports custom reward modeling for anyone to perform reward modeling on their dataset and model.
Expected dataset format
The reward trainer expects a very specific format for the dataset. Since the model will be trained to predict which sentence is the most relevant, given two sentences. We provide an example from the Anthropic/hh-rlhf
dataset below:
Therefore the final dataset object should contain two 4 entries at least if you use the default RewardDataCollatorWithPadding
data collator. The entries should be named:
input_ids_chosen
attention_mask_chosen
input_ids_rejected
attention_mask_rejected
The j
and k
suffixes are used to denote the two sentences in the paired dataset.
Using the RewardTrainer
After standardizing your dataset, you can use the RewardTrainer
as a classic HugingFace Trainer.
You should pass an AutoModelForSequenceClassification
model to the RewardTrainer
.
Leveraging the peft
library to train a reward model
Just pass a peft_config
in the key word arguments of RewardTrainer
, and the trainer should automatically take care of converting the model into a PEFT model!
from peft import LoraConfig, task_type
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments
from trl import RewardTrainer
model = AutoModelForSequenceClassification.from_pretrained("gpt2")
peft_config = LoraConfig(
task_type=TaskType.SEQ_CLS,
inference_mode=False,
r=8,
lora_alpha=32,
lora_dropout=0.1,
)
...
trainer = RewardTrainer(
model=model,
args=training_args,
tokenizer=tokenizer,
train_dataset=dataset,
peft_config=peft_config,
)
trainer.train()
RewardTrainer
class trl.RewardTrainer
< source >( model: typing.Union[transformers.modeling_utils.PreTrainedModel, torch.nn.modules.module.Module] = None args: TrainingArguments = None data_collator: typing.Optional[DataCollator] = None train_dataset: typing.Optional[datasets.arrow_dataset.Dataset] = None eval_dataset: typing.Union[datasets.arrow_dataset.Dataset, typing.Dict[str, datasets.arrow_dataset.Dataset], NoneType] = None tokenizer: typing.Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase] = None model_init: typing.Union[typing.Callable[[], transformers.modeling_utils.PreTrainedModel], NoneType] = None compute_metrics: typing.Union[typing.Callable[[transformers.trainer_utils.EvalPrediction], typing.Dict], NoneType] = None callbacks: typing.Optional[typing.List[transformers.trainer_callback.TrainerCallback]] = None optimizers: typing.Tuple[torch.optim.optimizer.Optimizer, torch.optim.lr_scheduler.LambdaLR] = (None, None) preprocess_logits_for_metrics: typing.Union[typing.Callable[[torch.Tensor, torch.Tensor], torch.Tensor], NoneType] = None max_length: typing.Optional[int] = None peft_config: typing.Optional[typing.Dict] = None )
The RewardTrainer can be used to train your custom Reward Model. It is a subclass of the
transformers.Trainer
class and inherits all of its attributes and methods. It is recommended to use
an AutoModelForSequenceClassification
as the reward model. The reward model should be trained on a dataset
of paired examples, where each example is a tuple of two sequences. The reward model should be trained to
predict which example in the pair is more relevant to the task at hand.
The reward trainer expects a very specific format for the dataset. The dataset should contain two 4 entries at least
if you don’t use the default RewardDataCollatorWithPadding
data collator. The entries should be named
input_ids_chosen
attention_mask_chosen
input_ids_rejected
attention_mask_rejected