Deep RL Course documentation

Mid-way Recap

Deep RL Course

Unit 0. Welcome to the course

Unit 1. Introduction to Deep Reinforcement Learning

Bonus Unit 1. Introduction to Deep Reinforcement Learning with Huggy

Live 1. How the course work, Q&A, and playing with Huggy

Unit 2. Introduction to Q-Learning

Introduction What is RL? A short recap The two types of value-based methods The Bellman Equation, simplify our value estimation Monte Carlo vs Temporal Difference Learning Mid-way Recap Mid-way Quiz Introducing Q-Learning A Q-Learning example Q-Learning Recap Glossary Hands-on Q-Learning Quiz Conclusion Additional Readings

Unit 3. Deep Q-Learning with Atari Games

Bonus Unit 2. Automatic Hyperparameter Tuning with Optuna

Unit 4. Policy Gradient with PyTorch

Unit 5. Introduction to Unity ML-Agents

Unit 6. Actor Critic methods with Robotics environments

Unit 7. Introduction to Multi-Agents and AI vs AI

Unit 8. Part 1 Proximal Policy Optimization (PPO)

Unit 8. Part 2 Proximal Policy Optimization (PPO) with Doom

Bonus Unit 3. Advanced Topics in Reinforcement Learning

Certification and congratulations

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Mid-way Recap

Before diving into Q-Learning, let’s summarize what we’ve just learned.

We have two types of value-based functions:

State-value function: outputs the expected return if the agent starts at a given state and acts according to the policy forever after.
Action-value function: outputs the expected return if the agent starts in a given state, takes a given action at that state and then acts accordingly to the policy forever after.
In value-based methods, rather than learning the policy, we define the policy by hand and we learn a value function. If we have an optimal value function, we will have an optimal policy.

There are two types of methods to update the value function:

With the Monte Carlo method, we update the value function from a complete episode, and so we use the actual discounted return of this episode.
With the TD Learning method, we update the value function from a step, replacing the unknown $G_t$ with an estimated return called the TD target.

Summary

←Monte Carlo vs Temporal Difference Learning Mid-way Quiz→