Deep RL Course documentation


Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started


Unit 8

In Unit 6, we learned about Advantage Actor Critic (A2C), a hybrid architecture combining value-based and policy-based methods that helps to stabilize the training by reducing the variance with:

  • An Actor that controls how our agent behaves (policy-based method).
  • A Critic that measures how good the action taken is (value-based method).

Today we’ll learn about Proximal Policy Optimization (PPO), an architecture that improves our agent’s training stability by avoiding policy updates that are too large. To do that, we use a ratio that indicates the difference between our current and old policy and clip this ratio to a specific range[1ϵ,1+ϵ] [1 - \epsilon, 1 + \epsilon] .

Doing this will ensure that our policy update will not be too large and that the training is more stable.

This Unit is in two parts:

  • In this first part, you’ll learn the theory behind PPO and code your PPO agent from scratch using the CleanRL implementation. To test its robustness you’ll use LunarLander-v2. LunarLander-v2 is the first environment you used when you started this course. At that time, you didn’t know how PPO worked, and now, you can code it from scratch and train it. How incredible is that 🤩.
  • In the second part, we’ll get deeper into PPO optimization by using Sample-Factory and train an agent playing vizdoom (an open source version of Doom).
These are the environments you're going to use to train your agents: VizDoom environments

Sound exciting? Let’s get started! 🚀