Off-Policy Gradient Methods
Recall that from Policy Gradient Methods, the calculation of gradient is
$$
\begin{aligned}
\nabla \hat{R} _{\theta} &= \hat{ \mathbb{E}} _{\tau \sim \pi _{\theta}(\tau)} \bigg[ R(\tau) \nabla \log \pi _{\theta} (\tau) \bigg]
\end{aligned}
$$
where
- $\tau$ is a trajectory
- $\pi_{\theta}$ is a stochastic policy
- $\nabla_ {\pi _{\theta}}$ is the gradient operator of the policy $\pi$ with respect to $\theta$
- $\hat{\mathbb{E}} _{\tau \sim \pi _{\theta}(\tau)}[\dots]$ is the empirical average expectation over a finite batch of samples in an algorithm that alternates between sampling and optimization
In this setting, we have to:
- Use $\pi _{\theta}$ to collect data. When the parameter $\theta$ is updated, we must sample the training data again.
- This is called on-policy training
With Importance Sampling, we can alter the equation to
$$
\begin{aligned}
&= \hat{ \mathbb{E}} _{\tau \sim \pi _{\theta’} (\tau)} \bigg[ \frac{ \pi _{\theta}(\tau)} {\pi _{\theta’} (\tau)} R(\tau) \nabla \log \pi _{\theta} (\tau) \bigg]
\end{aligned}
$$
In this setting, we can:
- Use the samples drawn from a fixed distribution $\pi_{\theta’}$ to train $\theta$. ($\theta’$ and $\theta$ are different)
- We can sample data once to train $\theta$ many times.
- To a certain extent, we can update $\pi_{\theta’}$, sample new data, and use the new data to train $\theta$.
- This is called off-policy training.
We learned from Policy Gradient Methods that we would use an advantage function $A^\theta$. Therefore the gradient should be
$$
\begin{aligned}
&= \hat{\mathbb{E}} _{(s_t, a_t) \sim \pi _{\theta’} (\tau)} \bigg[ \frac{ \pi _{\theta} (s_t, a_t)}{\pi _{\theta’}(s_t, a_t)} A^{\theta’} (s_t, a_t) \nabla \log \pi _{\theta} (a^n_t | s^n_t) \bigg] \\
&= \hat{\mathbb{E}} _{(s_t, a_t) \sim \pi _{\theta’} (\tau)} \bigg[ \frac{ \pi _{\theta} (a_t | s_t)} {\pi _{\theta’}(a_t | s_t)} \frac{\pi _{\theta}(s_t)}{\pi _{\theta’} (s_t)} A^{\theta’} (s_t, a_t) \nabla \log \pi _{\theta}(a^n_t | s^n_t) \bigg] \\
&= \hat{\mathbb{E}} _{(s_t, a_t) \sim \pi _{\theta’}(\tau)} \bigg[ \frac{\pi _{\theta} (a_t | s_t)}{\pi _{\theta’} (a_t | s_t)} A^{\theta’} (s_t, a_t) \nabla \log \pi _{\theta} (a^n_t | s^n_t) \bigg] \\
\end{aligned}
$$
Note that:
- $(s_t, a_t)$ are training data drawn from a fix distribution $\pi_{\theta’}$
- The calculation of the advantage function $A^{\theta’}$ is based on $\theta’$
- In line 2, we assume that $\pi_{\theta}(s_t)$ equals to $\pi_{\theta’}(s_t)$, therefore we eliminate $\frac{\pi_{\theta}(s_t)}{\pi_{\theta’}(s_t)}$. This hypothesis assumes that the probability of running into a state $s_t$ is independent from the $\theta$ (environment).
Recall that for a gradient:
$$
\nabla f(x) = f(x) \nabla \log f(x)
$$
Let $f(x) = \pi_{\theta}(a^n_t | s^n_t)$. From above equation, combing $\pi_{\theta}(a^n_t | s^n_t) \nabla \log \pi_{\theta}(a^n_t | s^n_t)$ will get $\nabla \pi_{\theta}(a^n_t | s^n_t)$. Hence
$$
\begin{aligned}
&= \hat{\mathbb{E}} _{(s_t, a_t) \sim \pi _{\theta’}(\tau)} \bigg[ \frac{ \nabla \pi _{\theta}(a_t | s_t)} {\pi _{\theta’}(a_t | s_t)} A^{\theta’}(s_t, a_t) \bigg]
\end{aligned}
$$
By removing the gradient, we now have an objective function:
$$
\begin{aligned}
J ^{\theta’} (\theta) &= \hat{\mathbb{E}} _{(s_t, a_t) \sim \pi _{\theta’}(\tau)}\bigg[ \frac{\pi _{\theta}(a_t | s_t)}{\pi _{\theta’}(a_t | s_t)} A^{\theta’}(s_t, a_t) \bigg] \
\end{aligned}
$$
This means we use data sampled from $\theta’$ to update $\theta$.