Deep RL Course documentation

(Optional) the Policy Gradient Theorem

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

(Optional) the Policy Gradient Theorem

In this optional section where we’re going to study how we differentiate the objective function that we will use to approximate the policy gradient.

Let’s first recap our different formulas:

  1. The Objective function
Return
  1. The probability of a trajectory (given that action comes fromπθ\pi_\theta):
Probability

So we have: θJ(θ)=θτP(τ;θ)R(τ)\nabla_\theta J(\theta) = \nabla_\theta \sum_{\tau}P(\tau;\theta)R(\tau)

We can rewrite the gradient of the sum as the sum of the gradient: =τθ(P(τ;θ)R(τ))=τθP(τ;θ)R(τ) = \sum_{\tau} \nabla_\theta (P(\tau;\theta)R(\tau)) = \sum_{\tau} \nabla_\theta P(\tau;\theta)R(\tau) asR(τ)R(\tau) is not dependent onθ\theta

We then multiply every term in the sum byP(τ;θ)P(τ;θ)\frac{P(\tau;\theta)}{P(\tau;\theta)}(which is possible since it’s = 1) =τP(τ;θ)P(τ;θ)θP(τ;θ)R(τ) = \sum_{\tau} \frac{P(\tau;\theta)}{P(\tau;\theta)}\nabla_\theta P(\tau;\theta)R(\tau)

We can simplify further this sinceP(τ;θ)P(τ;θ)θP(τ;θ) \frac{P(\tau;\theta)}{P(\tau;\theta)}\nabla_\theta P(\tau;\theta).

Thus we can rewrite the sum as=P(τ;θ)θP(τ;θ)P(τ;θ) = P(\tau;\theta)\frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)} P(τ;θ)θP(τ;θ)P(τ;θ)=τP(τ;θ)θP(τ;θ)P(τ;θ)R(τ) P(\tau;\theta)\frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)}= \sum_{\tau} P(\tau;\theta) \frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)}R(\tau)

We can then use the derivative log trick (also called likelihood ratio trick or REINFORCE trick), a simple rule in calculus that implies thatxlogf(x)=xf(x)f(x) \nabla_x log f(x) = \frac{\nabla_x f(x)}{f(x)}

So given we haveθP(τ;θ)P(τ;θ)\frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)} we transform it asθlogP(τθ)\nabla_\theta log P(\tau|\theta)

So this is our likelihood policy gradient: θJ(θ)=τP(τ;θ)θlogP(τ;θ)R(τ) \nabla_\theta J(\theta) = \sum_{\tau} P(\tau;\theta) \nabla_\theta log P(\tau;\theta) R(\tau)

Thanks for this new formula, we can estimate the gradient using trajectory samples (we can approximate the likelihood ratio policy gradient with sample-based estimate if you prefer). θJ(θ)=1mi=1mθlogP(τ(i);θ)R(τ(i))\nabla_\theta J(\theta) = \frac{1}{m} \sum^{m}_{i=1} \nabla_\theta log P(\tau^{(i)};\theta)R(\tau^{(i)}) where eachτ(i)\tau^{(i)} is a sampled trajectory.

But we still have some mathematics work to do there: we need to simplifyθlogP(τθ) \nabla_\theta log P(\tau|\theta)

We know that: θlogP(τ(i);θ)=θlog[μ(s0)t=0HP(st+1(i)st(i),at(i))πθ(at(i)st(i))]\nabla_\theta log P(\tau^{(i)};\theta)= \nabla_\theta log[ \mu(s_0) \prod_{t=0}^{H} P(s_{t+1}^{(i)}|s_{t}^{(i)}, a_{t}^{(i)}) \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})]

Whereμ(s0)\mu(s_0) is the initial state distribution andP(st+1(i)st(i),at(i)) P(s_{t+1}^{(i)}|s_{t}^{(i)}, a_{t}^{(i)}) is the state transition dynamics of the MDP.

We know that the log of a product is equal to the sum of the logs: θlogP(τ(i);θ)=θ[logμ(s0)+t=0HlogP(st+1(i)st(i)at(i))+t=0Hlogπθ(at(i)st(i))]\nabla_\theta log P(\tau^{(i)};\theta)= \nabla_\theta \left[log \mu(s_0) + \sum\limits_{t=0}^{H}log P(s_{t+1}^{(i)}|s_{t}^{(i)} a_{t}^{(i)}) + \sum\limits_{t=0}^{H}log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})\right]

We also know that the gradient of the sum is equal to the sum of gradient: θlogP(τ(i);θ)=θlogμ(s0)+θt=0HlogP(st+1(i)st(i)at(i))+θt=0Hlogπθ(at(i)st(i)) \nabla_\theta log P(\tau^{(i)};\theta)=\nabla_\theta log\mu(s_0) + \nabla_\theta \sum\limits_{t=0}^{H} log P(s_{t+1}^{(i)}|s_{t}^{(i)} a_{t}^{(i)}) + \nabla_\theta \sum\limits_{t=0}^{H} log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})

Since neither initial state distribution or state transition dynamics of the MDP are dependent ofθ\theta, the derivate of both terms are 0. So we can remove them:

Since:θt=0HlogP(st+1(i)st(i)at(i))=0\nabla_\theta \sum_{t=0}^{H} log P(s_{t+1}^{(i)}|s_{t}^{(i)} a_{t}^{(i)}) = 0 andθμ(s0)=0 \nabla_\theta \mu(s_0) = 0 θlogP(τ(i);θ)=θt=0Hlogπθ(at(i)st(i))\nabla_\theta log P(\tau^{(i)};\theta) = \nabla_\theta \sum_{t=0}^{H} log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})

We can rewrite the gradient of the sum as the sum of gradients: θlogP(τ(i);θ)=t=0Hθlogπθ(at(i)st(i)) \nabla_\theta log P(\tau^{(i)};\theta)= \sum_{t=0}^{H} \nabla_\theta log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})

So, the final formula for estimating the policy gradient is: θJ(θ)=g^=1mi=1mt=0Hθlogπθ(at(i)st(i))R(τ(i)) \nabla_{\theta} J(\theta) = \hat{g} = \frac{1}{m} \sum^{m}_{i=1} \sum^{H}_{t=0} \nabla_\theta \log \pi_\theta(a^{(i)}_{t} | s_{t}^{(i)})R(\tau^{(i)})

< > Update on GitHub