# Introduction

In the last unit, we learned about Deep Q-Learning. In this value-based deep reinforcement learning algorithm, we **used a deep neural network to approximate the different Q-values for each possible action at a state.**

Since the beginning of the course, we have only studied value-based methods, **where we estimate a value function as an intermediate step towards finding an optimal policy.**

In value-based methods, the policy **(π) only exists because of the action value estimates since the policy is just a function** (for instance, greedy-policy) that will select the action with the highest value given a state.

With policy-based methods, we want to optimize the policy directly **without having an intermediate step of learning a value function.**

So today, **we’ll learn about policy-based methods and study a subset of these methods called policy gradient**. Then we’ll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch.
Then, we’ll test its robustness using the CartPole-v1 and PixelCopter environments.

You’ll then be able to iterate and improve this implementation for more advanced environments.

Let’s get started!