TRL documentation
Logging
You are viewing v0.4.7 version.
			
				A newer version
					v0.24.0 is available.
Logging
As reinforcement learning algorithms are historically challenging to debug, it’s important to pay careful attention to logging.
By default, the TRL PPOTrainer saves a lot of relevant information to wandb or tensorboard.
Upon initialization, pass one of these two options to the PPOConfig:
config = PPOConfig(
    model_name=args.model_name,
    log_with=`wandb`, # or `tensorboard`
)If you want to log with tensorboard, add the kwarg project_kwargs={"logging_dir": PATH_TO_LOGS} to the PPOConfig.
PPO Logging
Crucial values
During training, many values are logged, here are the most important ones:- env/reward_mean,- env/reward_std,- env/reward_dist: the properties of the reward distribution from the “environment”.
- ppo/mean_scores: The mean scores directly out of the reward model.
- ppo/mean_non_score_reward: The mean negated KL penalty during training (shows the delta between the reference model and the new policy over the batch in the step)
Training stability parameters:
Here are some parameters that are useful to monitor for stability (when these diverge or collapse to 0, try tuning variables):- ppo/loss/value: The value function loss — will spike / NaN when not going well.
- ppo/val/clipfrac: The fraction of clipped values in the value function loss. This is often from 0.3 to 0.6.
- objective/kl_coef: The target coefficient with- AdaptiveKLController. Often increases before numerical instabilities.