TRL documentation

Logging

You are viewing v0.4.7 version. A newer version v0.11.1 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Logging

As reinforcement learning algorithms are historically challenging to debug, it’s important to pay careful attention to logging. By default, the TRL PPOTrainer saves a lot of relevant information to wandb or tensorboard.

Upon initialization, pass one of these two options to the PPOConfig:

config = PPOConfig(
    model_name=args.model_name,
    log_with=`wandb`, # or `tensorboard`
)

If you want to log with tensorboard, add the kwarg project_kwargs={"logging_dir": PATH_TO_LOGS} to the PPOConfig.

PPO Logging

Crucial values

During training, many values are logged, here are the most important ones:
  1. env/reward_mean,env/reward_std, env/reward_dist: the properties of the reward distribution from the “environment”.
  2. ppo/mean_scores: The mean scores directly out of the reward model.
  3. ppo/mean_non_score_reward: The mean negated KL penalty during training (shows the delta between the reference model and the new policy over the batch in the step)

Training stability parameters:

Here are some parameters that are useful to monitor for stability (when these diverge or collapse to 0, try tuning variables):
  1. ppo/loss/value: The value function loss — will spike / NaN when not going well.
  2. ppo/val/clipfrac: The fraction of clipped values in the value function loss. This is often from 0.3 to 0.6.
  3. objective/kl_coef: The target coefficient with AdaptiveKLController. Often increases before numerical instabilities.