Logging

As reinforcement learning algorithms are historically challenging to debug, it’s important to pay careful attention to logging. By default, the TRL PPOTrainer saves a lot of relevant information to wandb or tensorboard.

Upon initialization, pass one of these two options to the PPOConfig:

config = PPOConfig(
    model_name=args.model_name,
    log_with=`wandb`, # or `tensorboard`
)

If you want to log with tensorboard, add the kwarg project_kwargs={"logging_dir": PATH_TO_LOGS} to the PPOConfig.

PPO Logging

Crucial values

During training, many values are logged, here are the most important ones:

env/reward_mean,env/reward_std, env/reward_dist: the properties of the reward distribution from the “environment”.
ppo/mean_scores: The mean scores directly out of the reward model.
ppo/mean_non_score_reward: The mean negated KL penalty during training (shows the delta between the reference model and the new policy over the batch in the step)

Training stability parameters:

Here are some parameters that are useful to monitor for stability (when these diverge or collapse to 0, try tuning variables):

ppo/loss/value: The value function loss — will spike / NaN when not going well.
ppo/val/clipfrac: The fraction of clipped values in the value function loss. This is often from 0.3 to 0.6.
objective/kl_coef: The target coefficient with AdaptiveKLController. Often increases before numerical instabilities.