# Monte Carlo vs Temporal Difference Learning

The last thing we need to discuss before diving into Q-Learning is the two learning strategies.

Remember that an RL agent **learns by interacting with its environment.** The idea is that **given the experience and the received reward, the agent will update its value function or policy.**

Monte Carlo and Temporal Difference Learning are two different **strategies on how to train our value function or our policy function.** Both of them **use experience to solve the RL problem.**

On one hand, Monte Carlo uses **an entire episode of experience before learning.** On the other hand, Temporal Difference uses **only a step ( $S_t, A_t, R_{t+1}, S_{t+1}$ ) to learn.**

We’ll explain both of them **using a value-based method example.**

## Monte Carlo: learning at the end of the episode

Monte Carlo waits until the end of the episode, calculates $G_t$ (return) and uses it as **a target for updating $V(S_t)$.**

So it requires a **complete episode of interaction before updating our value function.**

If we take an example:

We always start the episode

**at the same starting point.****The agent takes actions using the policy**. For instance, using an Epsilon Greedy Strategy, a policy that alternates between exploration (random actions) and exploitation.We get

**the reward and the next state.**We terminate the episode if the cat eats the mouse or if the mouse moves > 10 steps.

At the end of the episode,

**we have a list of State, Actions, Rewards, and Next States tuples**For instance [[State tile 3 bottom, Go Left, +1, State tile 2 bottom], [State tile 2 bottom, Go Left, +0, State tile 1 bottom]…]**The agent will sum the total rewards $G_t$**(to see how well it did).It will then

**update $V(s_t)$ based on the formula**Then

**start a new game with this new knowledge**

By running more and more episodes, **the agent will learn to play better and better.**

For instance, if we train a state-value function using Monte Carlo:

We just started to train our value function,

**so it returns 0 value for each state**Our learning rate (lr) is 0.1 and our discount rate is 1 (= no discount)

Our mouse

**explores the environment and takes random actions**

The mouse made more than 10 steps, so the episode ends .

We have a list of state, action, rewards, next_state,

**we need to calculate the return $G{t}$**$G_t = R_{t+1} + R_{t+2} + R_{t+3} ...$

$G_t = R_{t+1} + R_{t+2} + R_{t+3}…$ (for simplicity we don’t discount the rewards).

$G_t = 1 + 0 + 0 + 0+ 0 + 0 + 1 + 1 + 0 + 0$

$G_t= 3$

We can now update $V(S_0)$:

New $V(S_0) = V(S_0) + lr * [G_t — V(S_0)]$

New $V(S_0) = 0 + 0.1 * [3 – 0]$

New $V(S_0) = 0.3$

## Temporal Difference Learning: learning at each step

**Temporal Difference, on the other hand, waits for only one interaction (one step) $S_{t+1}$** to form a TD target and update $V(S_t)$ using $R_{t+1}$ and $\gamma * V(S_{t+1})$.

The idea with **TD is to update the $V(S_t)$ at each step.**

But because we didn’t experience an entire episode, we don’t have $G_t$ (expected return). Instead, **we estimate $G_t$ by adding $R_{t+1}$ and the discounted value of the next state.**

This is called bootstrapping. It’s called this **because TD bases its update part on an existing estimate $V(S_{t+1})$ and not a complete sample $G_t$.**

This method is called TD(0) or **one-step TD (update the value function after any individual step).**

If we take the same example,

We just started to train our value function, so it returns 0 value for each state.

Our learning rate (lr) is 0.1, and our discount rate is 1 (no discount).

Our mouse explore the environment and take a random action:

**going to the left**It gets a reward $R_{t+1} = 1$ since

**it eats a piece of cheese**

We can now update $V(S_0)$:

New $V(S_0) = V(S_0) + lr * [R_1 + \gamma * V(S_1) - V(S_0)]$

New $V(S_0) = 0 + 0.1 * [1 + 1 * 0–0]$

New $V(S_0) = 0.1$

So we just updated our value function for State 0.

Now we **continue to interact with this environment with our updated value function.**

If we summarize:

With

*Monte Carlo*, we update the value function from a complete episode, and so we**use the actual accurate discounted return of this episode.**With

*TD Learning*, we update the value function from a step, so we replace $G_t$ that we don’t have with**an estimated return called TD target.**