Foundations: MDPs, Value Functions, and the Bellman Equations

This is the second post in my field guide to reinforcement learning for control and robotics. The hub post laid out the map — the taxonomy, a five-question algorithm selector, the rules of thumb. This one starts filling in the detail. We begin where every RL algorithm begins: writing down the Markov Decision Process.

What RL actually is, in one paragraph

Reinforcement learning is the study of how to make good decisions in environments where the consequences of those decisions unfold over time and the best decisions have to be learned from experience rather than derived from a known model. An agent observes its environment, picks an action, receives a reward and a new observation, and uses that signal to gradually improve its decision-making. That’s the whole object. Everything in this field — Q-learning, PPO, SAC, model-based RL — is some strategy for doing that improvement well.

Every RL problem is an MDP. If you come from control, you already know this object — you just called it a discrete-time stochastic state-space model with a cost functional. The machinery is nearly identical. The differences are small but matter: RL maximizes where control minimizes, the dynamics are unknown rather than given, and the solution comes out as a policy rather than a closed-form control law. Once you see those three translations, the rest of the field falls into place.

The agent-environment loop

Every RL algorithm, no matter how sophisticated, is built on one cycle that repeats forever: the agent observes, acts, and learns. The widget below shows that cycle live. Hit play to watch the loop run; each tick, the agent picks an action based on its current observation, the environment transitions to a new state, and a reward is emitted back. Slow it down with the speed slider to see what’s happening at each step.

Interactive — the agent-environment loop

How RL works

Speed 1.0×

Step count 0

Cumulative reward 0.00

The agent (left) sees an observation, picks an action, and sends it to the environment (right). The environment updates its state and returns a new observation plus a reward. Every RL algorithm is a different way to choose actions that make the cumulative reward grow.

With the loop in mind, the formal object is what you’d expect.

The MDP, formally

An MDP is the tuple $\mathcal{M} = \langle \mathcal{S},\, \mathcal{A},\, P,\, R,\, \gamma \rangle$ — the state space $\mathcal{S}$, the action space $\mathcal{A}$, the transition kernel $P(s’ \mid s, a)$, the reward function $R(s, a)$, and the discount factor $\gamma \in [0, 1)$. That’s the whole object. Every algorithm we’ll cover in this series is some strategy for finding a good policy in this structure.

The vocabulary, decoded

Here’s the glossary every paper assumes you know. Reading it once unlocks 80% of the RL literature.

State $s_t$ is the information needed to predict the future. For a 7-DOF manipulator, this is joint angles, velocities, end-effector pose, maybe the target. The Markov property says $s_t$ summarizes everything relevant from history. When it’s violated — partial observability, hidden modes — you either stack frames into the observation or use a recurrent policy.
Action $a_t$ is what the agent commands. In robotics this is almost always continuous: torques, velocities, or end-effector twists. This one fact rules out a huge class of classical RL methods, which I’ll get to in the next post.
Reward $r_t = R(s_t, a_t)$ is the scalar the agent wants to maximize. In control terms, it’s negative running cost. RL maximizes; LQR minimizes. Same math, opposite sign.
Policy $\pi$ is the controller. Either deterministic $a = \pi(s)$ or stochastic $a \sim \pi(\cdot \mid s)$. In deep RL it’s a neural network.
Trajectory / rollout / episode is a sequence $(s_0, a_0, r_0, s_1, a_1, r_1, \ldots)$ generated by running $\pi$ in the environment. Episodes end on termination — task done, fell over, time limit reached.
Return $G_t$ is the discounted cumulative reward from time $t$: $G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}$. This is the objective the agent is actually maximizing — not the instantaneous reward.
Discount $\gamma$ controls how much the agent cares about the future. $\gamma = 0$ is myopic; $\gamma \to 1$ is farsighted. Mathematically it’s necessary for infinite-horizon convergence; practically it’s also a bias-variance knob.

Terminology trap. Reward and return are not the same word. Reward is one step. Return is the discounted sum over a whole trajectory. Confusing them will make every RL paper look wrong.

The discount factor, visually

Slide $\gamma$ below and watch how the present-value weighting changes. A reward $r = 1$ received $k$ steps from now is worth $\gamma^k$ today. The cyan dashed line marks the effective planning horizon $\frac{1}{1-\gamma}$ — the time at which the weight has dropped to about 37% of its starting value.

Choosing γ in practice

The effective planning horizon is approximately $\frac{1}{1 - \gamma}$. Write it on a sticky note — you’ll use it every time you start a new task.

$\gamma = 0.99$ → horizon ≈ 100 steps. Default for most robotics tasks at 50–100 Hz control rates.
$\gamma = 0.95$ → horizon ≈ 20 steps. Good for short episodic tasks like reaching or grasping.
$\gamma = 0.999$ → horizon ≈ 1000 steps. Long-horizon locomotion. Rarely needed; hurts sample efficiency.
$\gamma = 1$ → only legal for strictly finite episodes with bounded reward. Otherwise the return diverges and your value estimates explode.

Match γ to your task horizon, not to the paper you’re copying. A locomotion paper that used $\gamma = 0.99$ at 1000 Hz was effectively looking 100 ms ahead. That same $\gamma$ at 50 Hz looks 2 seconds ahead. Different problems. People reproduce published numbers and find their agent can’t learn — nine times out of ten, it’s a control-rate-discount mismatch.

Value functions

The central objects in RL are not the policy itself but the value of a policy — how much return you expect to collect if you follow it. There are two flavors, and the difference matters more than it looks.

\[V^\pi(s) = \mathbb{E}_\pi\!\left[\sum_{k=0}^\infty \gamma^k r_{t+k} \,\Big|\, s_t = s\right]\] \[Q^\pi(s, a) = \mathbb{E}_\pi\!\left[\sum_{k=0}^\infty \gamma^k r_{t+k} \,\Big|\, s_t = s,\, a_t = a\right]\]

$V$ tells you how good a state is under the current policy. $Q$ tells you how good it is to take action $a$ right now and then follow the policy. That extra degree of freedom in $Q$ is what lets you improve the policy — pick the action with the highest $Q$, and you’ve made progress.

The advantage function

The advantage is the single most useful derived quantity in modern RL:

\[A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)\]

It answers “how much better than average is this action?” Zero-mean by construction under $\pi$. Almost every modern policy-gradient method (PPO, A2C, SAC) uses advantage estimates rather than raw returns because the variance is vastly smaller. We’ll derive that in the policy-gradient post.

The Bellman equations — the recursion that runs RL

The defining property of value functions: the value at a state equals the immediate reward plus the discounted value of where you end up. Writing this recursion out gives the Bellman expectation equation:

\[Q^\pi(s,a) = \mathbb{E}_{s' \sim P}\!\left[ r + \gamma\, \mathbb{E}_{a' \sim \pi}[Q^\pi(s', a')] \right]\]

And for the optimal $Q^\star$, the Bellman optimality equation:

\[Q^\star(s,a) = \mathbb{E}_{s' \sim P}\!\left[ r + \gamma\, \max_{a'} Q^\star(s', a') \right]\]

That max is the only difference between evaluating a fixed policy and finding the best one. It’s also the reason Q-learning works at all — and the reason it struggles with continuous actions, where you can’t enumerate the max without some approximation.

Control engineer’s decoder ring. The Bellman equation is the discrete-time, stochastic, reward-signed analog of the Hamilton–Jacobi–Bellman PDE you’ve seen in optimal control. Value iteration is dynamic programming backwards from the terminal state. LQR is the closed-form solution when dynamics are linear and rewards are quadratic — in which case $V$ is a quadratic form and you recover the Riccati equation.

Q-learning on a gridworld

Nothing makes the recursion concrete like watching it run. Below is Q-learning on a 10×8 gridworld. Start (cyan square) is bottom-left, goal (orange ★) is top-right, obstacles are the dark-grey walls. Colors show $\max_a Q(s, a)$; arrows show the greedy policy. Click Train 100 ep and watch the value propagate backwards from the goal, step by step.

What the gridworld is really teaching you

Three phenomena are visible in that widget that show up in every RL algorithm you’ll ever write:

Credit propagation is slow. Reward information flows backwards one step per update. Even on a tiny grid this takes many episodes. On a real robot with a sparse reward, this is exactly why your agent learns nothing for the first 10,000 episodes — the reward signal hasn’t had time to reach the starting state yet.
Exploration is not free. Drop $\epsilon$ to 0 and the agent gets stuck following whatever wrong idea it had first. Push $\epsilon$ to 1 and it will never commit to a good path. This trade-off never goes away — modern methods (SAC, curiosity-driven exploration, parameter noise) just handle it more cleverly.
α is a bias/variance knob. Small $\alpha$ is stable but slow. Large $\alpha$ is fast but oscillates. Same story as every filter you’ve ever tuned.

Rule of thumb. If you can write your state and action spaces down on paper and the state space is $\lesssim 10^6$, tabular methods (Q-learning, SARSA) will solve your problem faster and more reliably than any deep method. Don’t reach for neural networks until the problem demands it. A lot of robotics “RL failures” are really “we used a neural net when a Q-table would have worked.”

The plan from here

You now have the vocabulary and the central recursion. The next post unpacks how to actually learn those value functions when you don’t know the dynamics — temporal-difference learning, SARSA, and the relationship between on-policy and off-policy updates. That’s where the tabular Q-learning you just watched gets its legs.

← Back to the field guide