Deep RL Course documentation

Two main approaches for solving RL problems

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Two main approaches for solving RL problems

Now that we learned the RL framework, how do we solve the RL problem?

In other words, how do we build an RL agent that can select the actions that maximize its expected cumulative reward?

The Policy π: the agent’s brain

The Policy π is the brain of our Agent, it’s the function that tells us what action to take given the state we are in. So it defines the agent’s behavior at a given time.

Policy
Think of policy as the brain of our agent, the function that will tell us the action to take given a state

This Policy is the function we want to learn, our goal is to find the optimal policy π*, the policy that maximizes expected return when the agent acts according to it. We find this π* through training.

There are two approaches to train our agent to find this optimal policy π*:

  • Directly, by teaching the agent to learn which action to take, given the current state: Policy-Based Methods.
  • Indirectly, teach the agent to learn which state is more valuable and then take the action that leads to the more valuable states: Value-Based Methods.

Policy-Based Methods

In Policy-Based methods, we learn a policy function directly.

This function will define a mapping from each state to the best corresponding action. Alternatively, it could define a probability distribution over the set of possible actions at that state.

Policy
As we can see here, the policy (deterministic) directly indicates the action to take for each step.

We have two types of policies:

  • Deterministic: a policy at a given state will always return the same action.
Policy
action = policy(state)
Policy
  • Stochastic: outputs a probability distribution over actions.
Policy
policy(actions | state) = probability distribution over the set of actions given the current state
Policy Based
Given an initial state, our stochastic policy will output probability distributions over the possible actions at that state.

If we recap:

Pbm recap Pbm recap

Value-based methods

In value-based methods, instead of learning a policy function, we learn a value function that maps a state to the expected value of being at that state.

The value of a state is the expected discounted return the agent can get if it starts in that state, and then acts according to our policy.

“Act according to our policy” just means that our policy is “going to the state with the highest value”.

Value based RL

Here we see that our value function defined values for each possible state.

Value based RL
Thanks to our value function, at each step our policy will select the state with the biggest value defined by the value function: -7, then -6, then -5 (and so on) to attain the goal.

Thanks to our value function, at each step our policy will select the state with the biggest value defined by the value function: -7, then -6, then -5 (and so on) to attain the goal.

If we recap:

Vbm recap Vbm recap