This is Post 4 in the field guide to reinforcement learning for control and robotics. The previous posts covered the MDP formalism and Bellman equations, then how temporal-difference learning actually estimates value functions from data. Both of those were about learning values. This post is about learning the policy directly.
The shift of perspective matters. Value-based methods — Q-learning, DQN — learn $Q^\star$ and read off a policy as $\arg\max_a Q(s, a)$. That works brilliantly with discrete actions and fails immediately with continuous ones, because you can’t enumerate the max over $\mathbb{R}^n$. Policy-gradient methods sidestep the problem entirely: parameterize the policy directly as $\pi_\theta(a \mid s)$, and do gradient ascent on expected return. Continuous actions come for free. Stochastic policies — the kind that enable principled exploration — come for free. And the same machinery generalizes to image observations, recurrent policies, and pretty much anything else you can put a gradient through.
The catch, as with everything in RL, is variance. The policy gradient estimator is technically unbiased and practically unusable until you tame it. This post is the tour of the three tricks that do.
The policy gradient theorem
The objective is the expected return of the policy, $J(\theta) = \mathbb{E}{\tau \sim \pi\theta}[G_0]$. Its gradient, remarkably, has a clean form that doesn’t require differentiating through the environment dynamics:
\[\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\!\left[ \sum_{t} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot \Psi_t \right]\]where $\Psi_t$ is any of: the return $G_t$, the Q-value $Q^\pi(s_t, a_t)$, or — in practice, the best choice — the advantage $A^\pi(s_t, a_t)$.
The geometric meaning is exactly what you’d want: push up the log-probability of actions whose weight $\Psi_t$ is high, push down the ones whose $\Psi_t$ is low. The score function $\nabla_\theta \log \pi_\theta$ is the direction in parameter space that increases the probability of the sampled action. Multiply it by how good that action turned out to be, sum over the trajectory, and you have the gradient.
This is REINFORCE when $\Psi_t = G_t$. It works in theory. It’s unbiased. It also has variance so extreme that you can’t realistically train anything harder than CartPole with it. The rest of this post is about why, and what to do about it.
Why variance is ruinous by default
There are two separate variance problems in REINFORCE.
First — the return itself is noisy. $G_t$ sums many random rewards over many random transitions. Even for a good policy in a stable environment, different rollouts produce wildly different returns. The gradient estimator inherits all of that noise.
Second — good actions can get low weight and bad actions can get high weight, purely by chance. Imagine a policy that takes a great action at step 3 of an episode but then gets unlucky later and earns a terrible return. REINFORCE will punish that great action, because $G_3$ captures everything downstream. The algorithm blames each action for the entire remainder of the trajectory, regardless of whether it was actually responsible.
Modern policy gradients fix both problems with three ideas that compound nicely.
Trick 1 — baselines
You can subtract any state-dependent function $b(s_t)$ from $\Psi_t$ without changing the expectation:
\[\mathbb{E}_\pi\!\left[ \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot b(s_t) \right] = 0\]Why? Because $b$ doesn’t depend on the action, so it factors out of the action expectation, and then $\mathbb{E}_{\pi}[\nabla \log \pi] = 0$ by the standard log-trick identity. Gradient unchanged, variance reduced. It’s a free lunch.
The natural choice is $b(s) = V^\pi(s)$, which turns $\Psi_t$ from the raw return into something much more meaningful:
\[A^\pi(s_t, a_t) = Q^\pi(s_t, a_t) - V^\pi(s_t)\]The advantage function. It measures how much better than average the action was, removing the baseline “everyone gets this much just for being in this state.” Now a great action that gets unlucky downstream gets its credit — because the baseline absorbs the general unluckiness of the state.
Trick 2 — actor-critic
In REINFORCE, the weight $\Psi_t$ is a Monte Carlo return. The actor-critic idea: use a learned $V_\phi(s)$ as the baseline and estimate the advantage by bootstrapping, just like TD did for value functions.
The simplest version uses the one-step TD error directly as the advantage estimate:
\[\hat A_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t) = \delta_t\]That’s the exact same TD error from the previous post. Low variance (one-step bootstrap), some bias (because $V_\phi$ is learned). The actor — the policy $\pi_\theta$ — is updated by policy gradient using $\hat A_t$. The critic — the value function $V_\phi$ — is updated by regression to its TD target. Two networks, one shared rollout, training together.
Trick 3 — GAE, the interpolation
One-step TD has low variance but high bias. Monte Carlo has zero bias but enormous variance. The $n$-step return from the previous post sits somewhere between. Generalized Advantage Estimation parameterizes the full spectrum with a single knob $\lambda$:
\[\hat A_t^{\text{GAE}(\gamma,\lambda)} = \sum_{k=0}^{\infty} (\gamma\lambda)^k\, \delta_{t+k}, \qquad \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\]GAE is an exponentially-weighted sum of TD errors over the rest of the rollout.
- $\lambda = 0$ recovers the one-step TD actor-critic.
- $\lambda = 1$ recovers the Monte Carlo advantage (with a learned baseline).
- $\lambda = 0.95$ is the near-universal choice — low enough to keep variance reasonable, high enough to propagate credit meaningfully.
GAE is the default advantage estimator in PPO, A2C, and just about every modern on-policy method. The $\lambda$ knob essentially never gets tuned; 0.95 just works.
Watching it happen
Below is a minimal 1D policy-gradient demo. The agent picks a single continuous action $a \in [-5, 5]$. The environment rewards actions near some hidden target $a^$: $r = -(a - a^)^2$, plus a bit of noise. The policy is Gaussian with mean $\mu_\theta$ (what the curve’s peak sits at) and fixed standard deviation $\sigma$ (how wide the curve is).
Click Step to sample an action, observe the reward, and update $\mu_\theta$ via the policy-gradient rule. Watch $\mu$ drift toward $a^*$ as rewarded actions (green dots) pull the distribution right and punished actions (orange) push it left. The baseline toggle switches between REINFORCE (raw reward) and an advantage-style update (reward minus a running average). The difference in variance is visible in the update sizes.
Three things worth noticing as you play:
- Reducing σ mid-training accelerates convergence but increases the chance of getting stuck. That’s the exploration-exploitation trade-off in continuous form. Too-wide a policy explores forever; too-narrow commits early. This is why modern methods either learn $\sigma$ (REINFORCE-style) or regularize it with entropy (SAC-style).
- REINFORCE’s updates are much larger and noisier than the baseline version. At the same learning rate, REINFORCE will oscillate; with the baseline, the policy walks cleanly to the target. Same gradient direction; different variance.
- Cranking α up makes REINFORCE unstable long before it breaks the baseline version. That’s the whole practical value of variance reduction: it lets you use a larger step size safely. And larger step sizes are what makes training fast.
Stochastic vs deterministic policies
For continuous actions, $\pi_\theta(a \mid s)$ is usually a Gaussian — the network outputs mean $\mu(s)$ and log-standard-deviation $\log\sigma(s)$. Sampling is $a = \mu + \sigma \cdot \epsilon$ with $\epsilon \sim \mathcal{N}(0, I)$. This is what PPO, A2C, and SAC use. The log-probability of the sampled action is differentiable, so the policy gradient works cleanly.
The alternative is a deterministic policy $a = \mu_\theta(s)$ with noise added externally for exploration during training. This is what DDPG and TD3 use, and it leads to a different update rule — the deterministic policy gradient theorem — and different stability properties. We’ll unpack those in the next post.
Terminology alert. The entropy of a stochastic policy, $H(\pi(\cdot \mid s)) = -\mathbb{E}[\log \pi]$, measures how random it is. Adding $\beta H$ to the objective encourages exploration — the policy is rewarded for staying uncertain. This is the core idea of maximum-entropy RL and the reason SAC is so robust. Rather than picking an exploration schedule and hoping it’s right, SAC treats entropy as a first-class part of the objective and lets the algorithm decide how much uncertainty to keep.
Trust regions, clipping, and PPO
REINFORCE with a baseline and GAE gets you A2C. One more problem remains: policy-gradient updates can take steps that are too big, landing you at a new policy that’s worse than the one you started with. The gradient is a first-order signal; far from the sampled data, it lies.
TRPO (Trust Region Policy Optimization) addresses this by enforcing a hard KL-divergence constraint between the old and new policies on every update. It works well but involves a constrained optimization at every step — conjugate gradients, Fisher-vector products — which is a lot of machinery.
PPO (Proximal Policy Optimization) is the simplification that ate the world. Instead of a hard constraint, PPO clips the probability ratio $r_t(\theta) = \pi_\theta(a_t \mid s_t) / \pi_{\theta_{\text{old}}}(a_t \mid s_t)$ to stay within $[1-\epsilon, 1+\epsilon]$, typically with $\epsilon = 0.2$. The clipped surrogate objective is:
\[L^{\text{CLIP}}(\theta) = \mathbb{E}_t\!\left[ \min\!\Big( r_t(\theta)\, \hat A_t,\; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\, \hat A_t \Big) \right]\]If the new policy drifts too far on an advantageous action, the gradient through the clip branch goes to zero — the update is capped. PPO is embarrassingly simple to implement (a single torch.clamp call), robust to hyperparameters, and has become the de facto standard for on-policy deep RL. The algorithm selector in the field guide recommends PPO for most sim-heavy continuous-control tasks, and that recommendation is driven entirely by this one simplification.
What’s actually running when you train PPO
To close the loop, here’s what each training iteration looks like in practice — every modern PPO implementation is some version of this:
- Roll out the current policy $\pi_{\theta_{\text{old}}}$ for $N$ steps in parallel environments; store $(s_t, a_t, r_t, \log \pi_{\theta_{\text{old}}}(a_t \mid s_t))$ for every transition.
- Compute the critic’s value estimates $V_\phi(s_t)$ for every state in the batch.
- Compute GAE advantages $\hat A_t$ using $\gamma$ and $\lambda$, from the stored rewards and $V_\phi$.
- Normalize the advantages across the batch to zero mean and unit variance (this is one of the “37 implementation details”).
- Compute the PPO-clipped surrogate $L^{\text{CLIP}}$ using the current policy $\pi_\theta$ against the stored $\log \pi_{\theta_{\text{old}}}$.
- Compute the value loss as MSE between $V_\phi(s_t)$ and the empirical return $\hat R_t = \hat A_t + V_\phi(s_t)$.
- Compute the entropy bonus $\beta H(\pi_\theta)$ to encourage exploration.
- Backpropagate the combined loss $-L^{\text{CLIP}} + c_1 L^{\text{VF}} - c_2 H$ through actor and critic. Repeat for ~10 epochs over the rollout.
- Throw away the rollout — it’s off-policy now that $\theta$ has moved. Go back to step 1.
The policy gradient is buried in step 5. Everything else is plumbing that makes the gradient estimate usable: the rollout buffer (data collection), GAE (variance reduction), normalization (numerical stability), clipping (trust-region surrogate), the critic loss (baseline learning), and the entropy bonus (exploration). That ratio — one gradient formula to nine lines of engineering — is what the rest of this series is really about.
The next post steps up to the algorithms that actually dominate continuous-control papers and competitions: SAC, TD3, DDPG, and how they diverge from PPO by embracing off-policy learning and deterministic policies. That’s where exploration, entropy regularization, and the choice of stochastic vs deterministic policies become the axes that distinguish algorithms.
← Back to the field guide