# Off-Policy Gradient Methods

Recall that from Policy Gradient Methods, the calculation of gradient is

$$

\begin{aligned}

\nabla \hat{R} _{\theta} &= \hat{ \mathbb{E}} _{\tau \sim \pi _{\theta}(\tau)} \bigg[ R(\tau) \nabla \log \pi _{\theta} (\tau) \bigg]

\end{aligned}

$$

where

- $\tau$ is a trajectory
- $\pi_{\theta}$ is a stochastic policy
- $\nabla_ {\pi _{\theta}}$ is the gradient operator of the policy $\pi$ with respect to $\theta$
- $\hat{\mathbb{E}} _{\tau \sim \pi _{\theta}(\tau)}[\dots]$ is the empirical average expectation over a finite batch of samples in an algorithm that alternates between sampling and optimization

In this setting, we have to:

- Use $\pi _{\theta}$ to collect data. When the parameter $\theta$ is updated, we must sample the training data again.
- This is called
*on-policy*training

With Importance Sampling, we can alter the equation to

$$

\begin{aligned}

&= \hat{ \mathbb{E}} _{\tau \sim \pi _{\theta’} (\tau)} \bigg[ \frac{ \pi _{\theta}(\tau)} {\pi _{\theta’} (\tau)} R(\tau) \nabla \log \pi _{\theta} (\tau) \bigg]

\end{aligned}

$$

In this setting, we can:

- Use the samples drawn from a
*fixed*distribution $\pi_{\theta’}$ to train $\theta$. ($\theta’$ and $\theta$ are different) - We can sample data once to train $\theta$ many times.
- To a certain extent, we can update $\pi_{\theta’}$, sample new data, and use the new data to train $\theta$.
- This is called
*off-policy*training.

We learned from Policy Gradient Methods that we would use an advantage function $A^\theta$. Therefore the gradient should be

$$

\begin{aligned}

&= \hat{\mathbb{E}} _{(s_t, a_t) \sim \pi _{\theta’} (\tau)} \bigg[ \frac{ \pi _{\theta} (s_t, a_t)}{\pi _{\theta’}(s_t, a_t)} A^{\theta’} (s_t, a_t) \nabla \log \pi _{\theta} (a^n_t | s^n_t) \bigg] \\

&= \hat{\mathbb{E}} _{(s_t, a_t) \sim \pi _{\theta’} (\tau)} \bigg[ \frac{ \pi _{\theta} (a_t | s_t)} {\pi _{\theta’}(a_t | s_t)} \frac{\pi _{\theta}(s_t)}{\pi _{\theta’} (s_t)} A^{\theta’} (s_t, a_t) \nabla \log \pi _{\theta}(a^n_t | s^n_t) \bigg] \\

&= \hat{\mathbb{E}} _{(s_t, a_t) \sim \pi _{\theta’}(\tau)} \bigg[ \frac{\pi _{\theta} (a_t | s_t)}{\pi _{\theta’} (a_t | s_t)} A^{\theta’} (s_t, a_t) \nabla \log \pi _{\theta} (a^n_t | s^n_t) \bigg] \\

\end{aligned}

$$

Note that:

- $(s_t, a_t)$ are training data drawn from a fix distribution $\pi_{\theta’}$
- The calculation of the advantage function $A^{\theta’}$ is based on $\theta’$
- In line 2, we assume that $\pi_{\theta}(s_t)$ equals to $\pi_{\theta’}(s_t)$, therefore we eliminate $\frac{\pi_{\theta}(s_t)}{\pi_{\theta’}(s_t)}$. This hypothesis assumes that the probability of running into a state $s_t$ is independent from the $\theta$ (environment).

Recall that for a gradient:

$$

\nabla f(x) = f(x) \nabla \log f(x)

$$

Let $f(x) = \pi_{\theta}(a^n_t | s^n_t)$. From above equation, combing $\pi_{\theta}(a^n_t | s^n_t) \nabla \log \pi_{\theta}(a^n_t | s^n_t)$ will get $\nabla \pi_{\theta}(a^n_t | s^n_t)$. Hence

$$

\begin{aligned}

&= \hat{\mathbb{E}} _{(s_t, a_t) \sim \pi _{\theta’}(\tau)} \bigg[ \frac{ \nabla \pi _{\theta}(a_t | s_t)} {\pi _{\theta’}(a_t | s_t)} A^{\theta’}(s_t, a_t) \bigg]

\end{aligned}

$$

By removing the gradient, we now have an objective function:

$$

\begin{aligned}

J ^{\theta’} (\theta) &= \hat{\mathbb{E}} _{(s_t, a_t) \sim \pi _{\theta’}(\tau)}\bigg[ \frac{\pi _{\theta}(a_t | s_t)}{\pi _{\theta’}(a_t | s_t)} A^{\theta’}(s_t, a_t) \bigg] \

\end{aligned}

$$

This means we use data sampled from $\theta’$ to update $\theta$.