\begin{aligned} \nabla \hat{R} _{\theta} &= \hat{ \mathbb{E}} _{\tau \sim \pi _{\theta}(\tau)} \bigg[ R(\tau) \nabla \log \pi _{\theta} (\tau) \bigg] \end{aligned}

where

• $\tau$ is a trajectory
• $\pi_{\theta}$ is a stochastic policy
• $\nabla_ {\pi _{\theta}}$ is the gradient operator of the policy $\pi$ with respect to $\theta$
• $\hat{\mathbb{E}} _{\tau \sim \pi _{\theta}(\tau)}[\dots]$ is the empirical average expectation over a finite batch of samples in an algorithm that alternates between sampling and optimization

In this setting, we have to:

• Use $\pi _{\theta}$ to collect data. When the parameter $\theta$ is updated, we must sample the training data again.
• This is called on-policy training

With Importance Sampling, we can alter the equation to

\begin{aligned} &= \hat{ \mathbb{E}} _{\tau \sim \pi _{\theta’} (\tau)} \bigg[ \frac{ \pi _{\theta}(\tau)} {\pi _{\theta’} (\tau)} R(\tau) \nabla \log \pi _{\theta} (\tau) \bigg] \end{aligned}

In this setting, we can:

• Use the samples drawn from a fixed distribution $\pi_{\theta’}$ to train $\theta$. ($\theta’$ and $\theta$ are different)
• We can sample data once to train $\theta$ many times.
• To a certain extent, we can update $\pi_{\theta’}$, sample new data, and use the new data to train $\theta$.
• This is called off-policy training.

We learned from Policy Gradient Methods that we would use an advantage function $A^\theta$. Therefore the gradient should be

\begin{aligned} &= \hat{\mathbb{E}} _{(s_t, a_t) \sim \pi _{\theta’} (\tau)} \bigg[ \frac{ \pi _{\theta} (s_t, a_t)}{\pi _{\theta’}(s_t, a_t)} A^{\theta’} (s_t, a_t) \nabla \log \pi _{\theta} (a^n_t | s^n_t) \bigg] \\ &= \hat{\mathbb{E}} _{(s_t, a_t) \sim \pi _{\theta’} (\tau)} \bigg[ \frac{ \pi _{\theta} (a_t | s_t)} {\pi _{\theta’}(a_t | s_t)} \frac{\pi _{\theta}(s_t)}{\pi _{\theta’} (s_t)} A^{\theta’} (s_t, a_t) \nabla \log \pi _{\theta}(a^n_t | s^n_t) \bigg] \\ &= \hat{\mathbb{E}} _{(s_t, a_t) \sim \pi _{\theta’}(\tau)} \bigg[ \frac{\pi _{\theta} (a_t | s_t)}{\pi _{\theta’} (a_t | s_t)} A^{\theta’} (s_t, a_t) \nabla \log \pi _{\theta} (a^n_t | s^n_t) \bigg] \\ \end{aligned}

Note that:

• $(s_t, a_t)$ are training data drawn from a fix distribution $\pi_{\theta’}$
• The calculation of the advantage function $A^{\theta’}$ is based on $\theta’$
• In line 2, we assume that $\pi_{\theta}(s_t)$ equals to $\pi_{\theta’}(s_t)$, therefore we eliminate $\frac{\pi_{\theta}(s_t)}{\pi_{\theta’}(s_t)}$. This hypothesis assumes that the probability of running into a state $s_t$ is independent from the $\theta$ (environment).

$$\nabla f(x) = f(x) \nabla \log f(x)$$
Let $f(x) = \pi_{\theta}(a^n_t | s^n_t)$. From above equation, combing $\pi_{\theta}(a^n_t | s^n_t) \nabla \log \pi_{\theta}(a^n_t | s^n_t)$ will get $\nabla \pi_{\theta}(a^n_t | s^n_t)$. Hence
\begin{aligned} &= \hat{\mathbb{E}} _{(s_t, a_t) \sim \pi _{\theta’}(\tau)} \bigg[ \frac{ \nabla \pi _{\theta}(a_t | s_t)} {\pi _{\theta’}(a_t | s_t)} A^{\theta’}(s_t, a_t) \bigg] \end{aligned}
\begin{aligned} J ^{\theta’} (\theta) &= \hat{\mathbb{E}} _{(s_t, a_t) \sim \pi _{\theta’}(\tau)}\bigg[ \frac{\pi _{\theta}(a_t | s_t)}{\pi _{\theta’}(a_t | s_t)} A^{\theta’}(s_t, a_t) \bigg] \ \end{aligned}
This means we use data sampled from $\theta’$ to update $\theta$.