Deep RL Course documentation

Monte Carlo vs Temporal Difference Learning

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Monte Carlo vs Temporal Difference Learning

The last thing we need to discuss before diving into Q-Learning is the two learning strategies.

Remember that an RL agent learns by interacting with its environment. The idea is that given the experience and the received reward, the agent will update its value function or policy.

Monte Carlo and Temporal Difference Learning are two different strategies on how to train our value function or our policy function. Both of them use experience to solve the RL problem.

On one hand, Monte Carlo uses an entire episode of experience before learning. On the other hand, Temporal Difference uses only a step ( St,At,Rt+1,St+1S_t, A_t, R_{t+1}, S_{t+1} ) to learn.

We’ll explain both of them using a value-based method example.

Monte Carlo: learning at the end of the episode

Monte Carlo waits until the end of the episode, calculates GtG_t (return) and uses it as a target for updating V(St)V(S_t).

So it requires a complete episode of interaction before updating our value function.

Monte Carlo

If we take an example:

Monte Carlo
  • We always start the episode at the same starting point.

  • The agent takes actions using the policy. For instance, using an Epsilon Greedy Strategy, a policy that alternates between exploration (random actions) and exploitation.

  • We get the reward and the next state.

  • We terminate the episode if the cat eats the mouse or if the mouse moves > 10 steps.

  • At the end of the episode, we have a list of State, Actions, Rewards, and Next States tuples For instance [[State tile 3 bottom, Go Left, +1, State tile 2 bottom], [State tile 2 bottom, Go Left, +0, State tile 1 bottom]…]

  • The agent will sum the total rewards GtG_t (to see how well it did).

  • It will then update V(st)V(s_t) based on the formula

    Monte Carlo
  • Then start a new game with this new knowledge

By running more and more episodes, the agent will learn to play better and better.

Monte Carlo

For instance, if we train a state-value function using Monte Carlo:

  • We just started to train our value function, so it returns 0 value for each state

  • Our learning rate (lr) is 0.1 and our discount rate is 1 (= no discount)

  • Our mouse explores the environment and takes random actions

    Monte Carlo
  • The mouse made more than 10 steps, so the episode ends .

    Monte Carlo
  • We have a list of state, action, rewards, next_state, we need to calculate the return GtG{t}

  • Gt=Rt+1+Rt+2+Rt+3...G_t = R_{t+1} + R_{t+2} + R_{t+3} ...

  • Gt=Rt+1+Rt+2+Rt+3G_t = R_{t+1} + R_{t+2} + R_{t+3}… (for simplicity we don’t discount the rewards).

  • Gt=1+0+0+0+0+0+1+1+0+0G_t = 1 + 0 + 0 + 0+ 0 + 0 + 1 + 1 + 0 + 0

  • Gt=3G_t= 3

  • We can now update V(S0)V(S_0):

    Monte Carlo
  • New V(S0)=V(S0)+lr[GtV(S0)]V(S_0) = V(S_0) + lr * [G_t — V(S_0)]

  • New V(S0)=0+0.1[30]V(S_0) = 0 + 0.1 * [3 – 0]

  • New V(S0)=0.3V(S_0) = 0.3

Monte Carlo

Temporal Difference Learning: learning at each step

Temporal Difference, on the other hand, waits for only one interaction (one step) St+1S_{t+1} to form a TD target and update V(St)V(S_t) using Rt+1R_{t+1} and γV(St+1) \gamma * V(S_{t+1}).

The idea with TD is to update the V(St)V(S_t) at each step.

But because we didn’t experience an entire episode, we don’t have GtG_t (expected return). Instead, we estimate GtG_t by adding Rt+1R_{t+1} and the discounted value of the next state.

This is called bootstrapping. It’s called this because TD bases its update part on an existing estimate V(St+1)V(S_{t+1}) and not a complete sample GtG_t.

Temporal Difference

This method is called TD(0) or one-step TD (update the value function after any individual step).

Temporal Difference

If we take the same example,

Temporal Difference
  • We just started to train our value function, so it returns 0 value for each state.

  • Our learning rate (lr) is 0.1, and our discount rate is 1 (no discount).

  • Our mouse explore the environment and take a random action: going to the left

  • It gets a reward Rt+1=1R_{t+1} = 1 since it eats a piece of cheese

    Temporal Difference
Temporal Difference

We can now update V(S0)V(S_0):

New V(S0)=V(S0)+lr[R1+γV(S1)V(S0)]V(S_0) = V(S_0) + lr * [R_1 + \gamma * V(S_1) - V(S_0)]

New V(S0)=0+0.1[1+100]V(S_0) = 0 + 0.1 * [1 + 1 * 0–0]

New V(S0)=0.1V(S_0) = 0.1

So we just updated our value function for State 0.

Now we continue to interact with this environment with our updated value function.

Temporal Difference

If we summarize:

  • With Monte Carlo, we update the value function from a complete episode, and so we use the actual accurate discounted return of this episode.

  • With TD Learning, we update the value function from a step, so we replace GtG_t that we don’t have with an estimated return called TD target.