Deep RL Course documentation

Mid-way Recap

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Mid-way Recap

Before diving into Q-Learning, let’s summarize what we just learned.

We have two types of value-based functions:

  • State-value function: outputs the expected return if the agent starts at a given state and acts accordingly to the policy forever after.
  • Action-value function: outputs the expected return if the agent starts in a given state, takes a given action at that state and then acts accordingly to the policy forever after.
  • In value-based methods, rather than learning the policy, we define the policy by hand and we learn a value function. If we have an optimal value function, we will have an optimal policy.

There are two types of methods to learn a policy for a value function:

  • With the Monte Carlo method, we update the value function from a complete episode, and so we use the actual accurate discounted return of this episode.
  • With the TD Learning method, we update the value function from a step, so we replace GtG_t that we don’t have with an estimated return called TD target.
Summary