In Unit 6, we learned about Advantage Actor Critic (A2C), a hybrid architecture combining value-based and policy-based methods that helps to stabilize the training by reducing the variance with:
- An Actor that controls how our agent behaves (policy-based method).
- A Critic that measures how good the action taken is (value-based method).
Today we’ll learn about Proximal Policy Optimization (PPO), an architecture that improves our agent’s training stability by avoiding policy updates that are too large. To do that, we use a ratio that indicates the difference between our current and old policy and clip this ratio to a specific range .
Doing this will ensure that our policy update will not be too large and that the training is more stable.
This Unit is in two parts:
- In this first part, you’ll learn the theory behind PPO and code your PPO agent from scratch using the CleanRL implementation. To test its robustness you’ll use LunarLander-v2. LunarLander-v2 is the first environment you used when you started this course. At that time, you didn’t know how PPO worked, and now, you can code it from scratch and train it. How incredible is that 🤩.
- In the second part, we’ll get deeper into PPO optimization by using Sample-Factory and train an agent playing vizdoom (an open source version of Doom).
Sound exciting? Let’s get started! 🚀