Deep RL Course documentation

The intuition behind PPO

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

The intuition behind PPO

The idea with Proximal Policy Optimization (PPO) is that we want to improve the training stability of the policy by limiting the change you make to the policy at each training epoch: we want to avoid having too large of a policy update.

For two reasons:

  • We know empirically that smaller policy updates during training are more likely to converge to an optimal solution.
  • A too-big step in a policy update can result in falling “off the cliff” (getting a bad policy) and taking a long time or even having no possibility to recover.
Policy Update cliff
Taking smaller policy updates to improve the training stability
Modified version from RL — Proximal Policy Optimization (PPO) Explained by Jonathan Hui

So with PPO, we update the policy conservatively. To do so, we need to measure how much the current policy changed compared to the former one using a ratio calculation between the current and former policy. And we clip this ratio in a range[1ϵ,1+ϵ] [1 - \epsilon, 1 + \epsilon] , meaning that we remove the incentive for the current policy to go too far from the old one (hence the proximal policy term).