Temporal Difference Learning, SARSA, and Q-Learning

This is Post 3 in the field guide to reinforcement learning for control and robotics, and it picks up directly from the foundations post. There, we wrote down the MDP and the Bellman equations. Now comes the question that makes RL RL: how do you estimate value functions when you don’t know the model?

The answer is temporal-difference learning, and it’s the single most important idea that distinguishes reinforcement learning from classical dynamic programming. One update rule unlocks SARSA, Q-learning, DQN, the advantage signal in every actor-critic method, and GAE. If you understand the TD error, you understand the spine of modern RL.

Three ways to estimate a value function

Given a policy $\pi$, there are three classical ways to compute $V^\pi(s)$. They differ in what they use as the target for each update.

Method	Target	Needs model?	Bias	Variance
Dynamic programming	$\mathbb{E}_\pi[R + \gamma V(S’)]$ exactly	Yes — $P, R$	None	None
Monte Carlo	$G_t = \sum_k \gamma^k R_{t+k+1}$	No	None	High
TD(0)	$R_{t+1} + \gamma V(S_{t+1})$	No	Some (bootstraps)	Low

Dynamic programming assumes you have the model — which you almost never do in the problems RL is meant for. Monte Carlo waits for the full return of each episode before updating, which gives unbiased estimates but enormous variance. TD(0) threads the needle: it bootstraps off its own current estimate of $V(S_{t+1})$ instead of waiting for the actual return, accepting a little bias in exchange for huge variance reductions. You can update every step, without a model, without episode boundaries.

The TD(0) update is the centerpiece of the field:

\[V(S_t) \;\leftarrow\; V(S_t) \;+\; \alpha \underbrace{\bigl[\, R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \,\bigr]}_{\delta_t \;-\; \text{the TD error}}\]

That quantity $\delta_t$ — the TD error — is everywhere in RL. Actor-critic methods use it as the advantage signal. GAE is just an exponentially-weighted sum of TD errors. Policy gradients effectively scale their updates by it. If one equation had to represent modern RL, it would be this one.

Control engineer’s decoder ring. TD is recursive estimation applied to value functions. The TD error $\delta_t$ is the innovation — exactly the quantity a Kalman filter uses to correct its prediction. Bootstrapping off $V(S_{t+1})$ instead of waiting for the full return is the same move a recursive filter makes instead of batch least-squares: trade a little bias for a lot less variance.

From TD(0) to control — SARSA

Extend TD(0) to $Q$-values instead of $V$-values, and the update becomes a control algorithm. SARSA stands for the five things its update uses: $(S, A, R, S’, A’)$.

\[Q(S_t, A_t) \;\leftarrow\; Q(S_t, A_t) + \alpha\bigl[\, R_{t+1} + \gamma\, Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \,\bigr]\]

Here $A_{t+1}$ is the action actually taken under the current behavior policy — typically $\varepsilon$-greedy with respect to $Q$. SARSA therefore learns the value of the exploratory policy. If the policy is risky (high $\varepsilon$), SARSA knows, and it accounts for that — it learns a $Q$ that reflects the reality of occasionally doing something random.

Q-learning — off-policy TD control

Swap the next-action value for the maximum over next actions and you get Q-learning:

\[Q(S_t, A_t) \;\leftarrow\; Q(S_t, A_t) + \alpha\bigl[\, R_{t+1} + \gamma\, \max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t) \,\bigr]\]

The update target refers to the greedy policy, even though the data came from $\varepsilon$-greedy. That’s the essence of off-policy learning: the behavior policy generating the data can differ from the target policy being evaluated. Q-learning converges to $Q^\star$ regardless of how exploratory the behavior policy is — as long as it eventually visits every state-action pair.

The one-character difference. SARSA uses $Q(S’, A’)$. Q-learning uses $\max_{a’} Q(S’, a’)$. That’s it. From a code perspective, the diff is one line. From a behavior perspective, it’s the difference between learning a policy that respects your exploration and learning the fearless optimum.

The cliff — on-policy vs off-policy, visibly

The classic way to see the distinction is the cliff-walking gridworld from Sutton & Barto. Start (S) and goal (G) sit at either end of row 3. The strip of cells along the bottom row between them is a cliff: step onto it and you get reward −100 and reset to start. Every other move costs −1. Both SARSA and Q-learning learn on the same data-generation policy (ε-greedy), on the same grid, with the same hyperparameters. Watch what happens to their learned greedy policies.

Interactive — SARSA vs Q-learning on the cliff

on-policy vs off-policy

ε-greedy 0.10

Learning rate α 0.50

Discount γ 0.95

SARSA · episodes 0

SARSA · avg return (last 50) —

Q-learning · episodes 0

Q-learning · avg return (last 50) —

S = start, G = goal, red strip = cliff (−100 reward + reset). Arrows show the greedy policy learned. The line traces the greedy path from S. Watch SARSA take the long safe route while Q-learning walks the cliff edge — that's the pedagogical signature of on-policy vs off-policy.

The interesting thing isn’t that Q-learning’s greedy path is shorter. It’s that SARSA’s learned policy reflects the exploration policy it will actually be executing. Under ε-greedy rollout, stepping right next to the cliff is risky — one random action and you fall in, losing 100. SARSA sees those falls happen, and backs up negative value into cells adjacent to the cliff. The safe long route scores higher. Q-learning, by contrast, backs up the value of the greedy policy — which would never step off the cliff deliberately — and ends up believing the cliff-edge path is fine, even though the agent it’s training is still ε-greedy and will fall off periodically.

Crank ε up higher (say, 0.3) and the gap widens dramatically. SARSA’s policy shifts further from the cliff. Q-learning’s greedy path doesn’t change — it still believes the edge is optimal — but the average return of the actual rollouts gets worse because the ε-greedy falls are more frequent.

This is the core trade-off. If your agent will deploy with exploration (or noisy actuators, or disturbances you can’t perfectly model), SARSA gives you a policy that accounts for reality. If you can guarantee greedy execution at deployment, Q-learning gives you the optimum.

On-policy vs off-policy, precisely defined

The distinction is simple but gets tangled in practice. Two policies are at play:

Behavior policy $b$ — the one that generates the data (rollouts, replay-buffer entries).
Target policy $\pi$ — the one whose value or performance you’re trying to learn or improve.

On-policy methods require $b = \pi$. Each update uses data generated by the current policy and only the current policy. PPO, A2C, TRPO, and SARSA are on-policy. This is why on-policy methods throw away rollouts after one gradient step — they can’t be used once the policy has changed.

Off-policy methods allow $b \neq \pi$. Data from any past policy (or a human demonstrator, or a scripted controller) can be reused. Q-learning, DQN, SAC, TD3, and DDPG are off-policy. This is what makes replay buffers work.

Importance sampling — the off-policy correction

If you want to use data generated under $b$ to estimate expectations under $\pi$, the samples must be reweighted:

\[\mathbb{E}_\pi[f(X)] \;=\; \mathbb{E}_b\!\left[\, \frac{\pi(X)}{b(X)} f(X) \,\right]\]

The ratio $\rho = \pi/b$ is the importance sampling weight. For a whole trajectory the weight becomes a product $\prod_t \rho_t$, whose variance explodes — which is why off-policy Monte Carlo is typically useless in practice.

TD-style off-policy methods (Q-learning, DQN) sidestep this almost entirely, because the max-action target doesn’t depend on the behavior distribution. Off-policy policy-gradient methods — V-trace in IMPALA, Retrace, Tree-Backup — must use truncated or clipped IS ratios to keep the variance bounded. We’ll see this again when we get to modern deep RL.

n-step TD and TD(λ) — bridging Monte Carlo and TD

TD(0) bootstraps after one step. Monte Carlo waits until the episode ends. You can sit anywhere between them with the n-step return:

\[G_t^{(n)} \;=\; R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n})\]

Larger $n$ means less bias (more real reward, less bootstrapping) but more variance. $n \in {3, 5, 10}$ is a common sweet spot in deep RL — it appears explicitly in Rainbow DQN and implicitly everywhere else.

TD(λ) is an exponentially-weighted average over every n-step return, with weight $(1-\lambda)\lambda^{n-1}$. It’s elegantly implementable online via eligibility traces — a vector $\mathbf{e}_t$ that remembers recently visited states:

\[\mathbf{e}_t \;=\; \gamma\lambda\, \mathbf{e}_{t-1} + \nabla V(S_t), \qquad \mathbf{w} \;\leftarrow\; \mathbf{w} + \alpha\,\delta_t\, \mathbf{e}_t\]

Eligibility traces are beautiful. They also rarely survive into modern deep RL, which prefers fixed-$n$ returns or the closely related GAE(λ) advantage estimator — a TD(λ)-style sum of TD errors over the rollout, standardized across a batch. We’ll derive GAE when we hit policy gradients.

The deadly triad

Sutton and Barto coined the term for three ingredients whose combination breaks classical convergence guarantees:

Function approximation — needed for any real state space (images, continuous physics).
Bootstrapping — using current estimates in the update target (the whole point of TD).
Off-policy learning — reusing data (the whole point of replay buffers).

Each pair is safe. Combining all three — which is exactly what DQN and every off-policy deep RL method does — admits counterexamples where the value estimate diverges even in toy problems. Baird’s star MDP is the canonical example: a seven-state MDP that makes tabular Q-learning with function approximation diverge to infinity with a textbook-tuned learning rate. This is why deep RL stabilization is an ongoing research program, not a solved problem.

The standard mitigations baked into DQN and its descendants:

Target networks — use a slowly updated copy $Q_{\bar\theta}$ in the bootstrapping target, so $\theta$ isn’t chasing its own tail.
Experience replay — break temporal correlation and make the update approximately i.i.d.
Clipped or Huber loss — dampen the effect of large TD errors early in training.
Soft updates — Polyak-averaged target network: $\bar\theta \leftarrow \tau\theta + (1-\tau)\bar\theta$, typically $\tau = 0.005$.

Rule of thumb. If you ever see your critic loss climb without bound, or your Q-values grow to $10^4$ on a problem whose returns are $\mathcal{O}(10^2)$, the deadly triad is at work. Check (in order): target-network lag, reward scaling, discount factor, and whether your replay buffer has gone stale relative to the current policy.

What to take away

The TD error is the quantity that flows through every modern RL algorithm. Once you see it in SARSA and Q-learning, you’ll see it again in actor-critic (as the advantage signal), in DQN (as the loss), in PPO (inside GAE), and in SAC (in both the critic update and the entropy-regularized policy improvement).

The SARSA / Q-learning split previews the on-policy / off-policy split that defines the rest of the field. The next post picks up policy gradients and actor-critic — where the TD error stops being a value-function update and becomes the signal that tells a neural-network policy which way to move.

← Back to the field guide