← Back to Autonomy
Interactive Field Guide · By Majid Mazouchi · Rev. 1.0

Optimal Control — the theory, the practice, the rules of thumb.

A working engineer's guide to optimal control: from the calculus of variations to Pontryagin's minimum principle, from LQR and Riccati equations to model predictive control and nonlinear trajectory optimization. Every section is paired with runnable simulations, practical implementation notes for embedded/real-time systems, and field-tested tuning heuristics.

Audience
Control engineers and graduate students who want intuition and numerical rigor.
Depth
Standard derivations, but the focus is on what's true, what breaks, and what to do about it.
Interactivity
Four live simulations: LQR tuning, Riccati convergence, MPC horizon, bang-bang control.
Practical slant
Embedded constraints, discretization, numerical conditioning, real-time solver notes.

Contents

01
Foundations

What problem does "optimal control" actually solve?

A dynamical system has states $x$ and inputs $u$. You choose $u(t)$ over time to minimize a cost — subject to physics. That's it. Everything else is mathematical machinery for doing it well.

The canonical continuous-time problem

Given dynamics $\dot{x} = f(x,u,t)$ with initial state $x(t_0) = x_0$, find the input trajectory $u(\cdot)$ on $[t_0, t_f]$ that minimizes the Bolza cost functional:

Bolza form $$J[u(\cdot)] = \underbrace{\phi(x(t_f), t_f)}_{\text{terminal / Mayer term}} + \int_{t_0}^{t_f} \underbrace{L(x(\tau), u(\tau), \tau)}_{\text{running / Lagrange term}} \, d\tau$$

Subject to path constraints: $g(x,u,t) \le 0$, terminal constraints: $\psi(x(t_f), t_f) = 0$, and possibly state/control bounds: $u(t) \in \mathcal{U}$, $x(t) \in \mathcal{X}$.

Three common degenerate forms you'll see named in the wild:

Lagrange problem
Only the running term. $J = \int L\,d\tau$. Typical for tracking with control effort: $L = \|x - x_{\text{ref}}\|_Q^2 + \|u\|_R^2$.
Mayer problem
Only the terminal term. $J = \phi(x(t_f))$. Example: maximize final battery charge, final projectile range, or miss-distance.
Minimum-time problem
$L \equiv 1$, so $J = t_f - t_0$. The free terminal time makes this structurally different — the optimal control is almost always bang-bang.

What "optimal" actually means

You do not get a closed-form $u^\star(t)$ by magic. You get one of three things, depending on the approach:

Mental model

Pontryagin tells you the necessary conditions that any optimal trajectory must satisfy. Bellman (HJB) tells you the sufficient condition via a value function. LQR is the one case where both collapse into a clean, computable answer. MPC is "solve a finite-horizon Pontryagin problem, apply the first step, repeat."

Core objects you need to be fluent with

State $x \in \mathbb{R}^n$
The minimal information required to predict future behavior given future inputs. Not outputs — measurements live elsewhere.
Control $u \in \mathbb{R}^m$
What you get to choose. Usually constrained (actuator limits, safety bounds).
Costate / Adjoint $\lambda \in \mathbb{R}^n$
Shadow price of the state. $\lambda_i(t)$ = marginal change in $J^\star$ per unit change in $x_i$ at time $t$. Central to Pontryagin; dual variable to the dynamics constraint.
Hamiltonian $H(x,u,\lambda,t)$
$H = L + \lambda^\top f$. Pontryagin's whole principle reduces to "minimize $H$ in $u$, pointwise in time."
Value function $V(x,t)$
The optimal cost-to-go from state $x$ at time $t$. Bellman's object. Satisfies the HJB PDE.
Policy $\pi(x,t)$
The feedback law. For LQR: $\pi(x) = -Kx$ with constant $K$ (infinite horizon).
Notation convention

We use lower-case $x$ for state (not $\mathbf{x}$). Vectors are columns. Gradients $\nabla_x L$ are row or column per context (we'll be explicit when it matters). Time-argument dropped when clear from context: $f(x,u)$ means $f(x(t), u(t), t)$ for the time-varying case.

02
Calculus of Variations

The ancestor: Euler–Lagrange.

Before optimal control was a field, there was the variational calculus. Most of the machinery — costate, Hamiltonian, transversality — is inherited directly from it.

Consider minimizing $J[x(\cdot)] = \int_{t_0}^{t_f} L(x(t), \dot x(t), t)\, dt$ over smooth trajectories $x(\cdot)$ with fixed endpoints. A necessary condition for a minimizer is the Euler–Lagrange equation:

Euler–Lagrange $$\frac{\partial L}{\partial x} - \frac{d}{dt}\!\left(\frac{\partial L}{\partial \dot x}\right) = 0$$

Derivation sketch: perturb $x \to x + \epsilon \eta$ with $\eta$ vanishing at endpoints. Require $\left.\tfrac{d}{d\epsilon}\right|_{\epsilon=0} J = 0$. Integrate by parts. The "arbitrariness" of $\eta$ gives the equation pointwise.

Transversality at free endpoints

If $t_f$ or $x(t_f)$ is free, you get boundary conditions for free:

From variations to control

Optimal control generalizes this by splitting $\dot x$ into "dynamics $f(x,u,t)$", introducing $u$ as a separate variable, and handling constraints explicitly. The costate $\lambda$ plays the role that $\partial L / \partial \dot x$ played in the classical problem.

Why this matters operationally

If you ever want to derive custom optimality conditions (e.g., minimum-energy trajectory on a manifold, optimal path through a vector field), the variational viewpoint is faster than grinding through the full PMP machinery. Just remember: for inequality constraints or non-smooth $L$, classical CoV breaks down — use Pontryagin.

03
Pontryagin's Minimum Principle

Necessary conditions, in seven lines.

If $u^\star(\cdot)$ is optimal, there exist costates $\lambda^\star(\cdot)$ such that the state, costate, and input jointly satisfy a two-point boundary value problem with a pointwise minimization in $u$.

Statement

Define the Hamiltonian

$$H(x, u, \lambda, t) = L(x,u,t) + \lambda^\top f(x,u,t).$$

Then along the optimal trajectory:

Pontryagin's necessary conditions $$\begin{aligned} \dot x^\star &= \phantom{-}\nabla_\lambda H = f(x^\star, u^\star, t) \quad &&\text{(state eqn, forward)} \\ \dot \lambda^\star &= -\nabla_x H = -\nabla_x L - (\nabla_x f)^\top \lambda^\star \quad &&\text{(costate eqn, backward)} \\ u^\star(t) &= \arg\min_{u \in \mathcal{U}} H(x^\star, u, \lambda^\star, t) \quad &&\text{(Hamiltonian minimization)} \\ \lambda^\star(t_f) &= \nabla_x \phi(x^\star(t_f)) \quad &&\text{(transversality)} \\ H|_{t=t_f} &= -\partial \phi / \partial t \quad &&\text{(only if $t_f$ free)} \end{aligned}$$

The first two give you a two-point boundary value problem (TPBVP): $x$ known at $t_0$, $\lambda$ known at $t_f$. The third is the point-wise optimality. The difficulty of the problem lives almost entirely in the TPBVP — it's forward-backward, which is why indirect methods are numerically tricky.

Unconstrained $u$: the stationarity condition

If $u$ is unconstrained and $H$ is differentiable in $u$, then "minimize in $u$" becomes $\nabla_u H = 0$. For $L = \tfrac{1}{2} u^\top R u + \cdots$ with $R \succ 0$ and $f$ control-affine ($f = a(x) + B(x) u$), this gives:

$$u^\star = -R^{-1} B(x)^\top \lambda^\star.$$

This is the control-affine shortcut that powers LQR, iLQR, and most of classical optimal control.

Constrained $u$: bang-bang emerges

If $u \in [u_{\min}, u_{\max}]$ and $H$ is linear in $u$ (which happens for $L$ independent of $u$, e.g. minimum-time), then minimizing $H$ picks $u$ at a corner of the constraint set whenever the "switching function" $\sigma(t) = \lambda^\top B$ is nonzero:

$$u^\star(t) = \begin{cases} u_{\max} & \sigma(t) < 0 \\ u_{\min} & \sigma(t) > 0 \end{cases}$$

When $\sigma(t) \equiv 0$ on an interval, you have a singular arc, and $u$ is determined by differentiating $\sigma$ until $u$ appears — a delicate case.

Lab 01 · Minimum-Time Double Integrator
bang-bang control via PMP
Problem: drive $\ddot x = u$, $|u|\le u_{\max}$, from $(x_0,v_0)$ to origin in minimum time. The switching curve is $\{(x,v): x = -\tfrac{1}{2u_{\max}}|v|v\}$.
min time $t_f^\star$
switch at
first action
position $x(t)$ velocity $v(t)$ control $u(t)$
Rule of thumb — PMP in practice

For trajectory generation offline, direct methods (discretize then optimize via NLP, e.g. collocation) are almost always easier than indirect methods (shoot the TPBVP). The indirect approach gives higher accuracy if your initial guess for $\lambda(t_0)$ is decent — which it usually isn't. Use indirect when you need ultra-high precision (e.g. interplanetary trajectory design) or when the structure (singular arcs, bang-bang switches) is analytically known.

04
Dynamic Programming & HJB

Bellman's principle and the PDE you rarely solve.

The value function $V(x,t)$ satisfies a partial differential equation. Solving it directly gives a feedback policy — no TPBVP, no initial costate guess. The catch is the curse of dimensionality.

The principle of optimality

Bellman's insight, crudely: whatever you did to get to state $x$ at time $t$, the optimal continuation from $(x,t)$ only depends on $(x,t)$. Formally:

Bellman's principle (continuous) $$V(x,t) = \min_{u(\cdot)}\! \left\{ \int_t^{t+\Delta t}\! L\, d\tau + V(x + \Delta x, t + \Delta t) \right\}$$

Taking $\Delta t \to 0$ and expanding $V$:

Hamilton–Jacobi–Bellman PDE $$-\frac{\partial V}{\partial t} = \min_{u \in \mathcal{U}} \left\{ L(x,u,t) + (\nabla_x V)^\top f(x,u,t) \right\} = \min_u H(x,u,\nabla_x V, t)$$ $$V(x, t_f) = \phi(x)$$

Notice the clean correspondence: $\lambda = \nabla_x V$. The costate is the gradient of the value function. This is exactly why PMP and HJB agree where both are applicable — they're two views of the same optimality.

Discrete version (Bellman equation)

$$V_k(x_k) = \min_{u_k} \left\{ \ell_k(x_k, u_k) + V_{k+1}(f_k(x_k, u_k)) \right\}$$

Solvable by backward recursion when the state/action spaces are finite and small. For continuous states, you must parametrize $V$ — linear combinations of basis functions, neural networks, tensor trains, etc. That's where Approximate Dynamic Programming lives.

Curse of dimensionality

A grid of $N$ points per dimension over $n$ state dimensions is $N^n$ cells. For $n=10$ and $N=100$ that's $10^{20}$ — hopeless. This is why HJB is rarely solved directly outside $n \le 4$ or so. The exceptions:

Pitfall — viscosity solutions

$V$ is typically non-smooth (corners appear where bang-bang switches happen). Classical PDE solutions can be ambiguous. The correct notion is a viscosity solution (Crandall–Lions). If you're doing numerical HJB with upwind schemes, you're implicitly selecting the viscosity solution — good. If you're naively differentiating, you can be wrong — bad.

05
Linear Quadratic Regulator

The one problem you can solve exactly.

Linear dynamics, quadratic cost. PMP and HJB agree, the value function is quadratic, the optimal controller is linear state feedback, and you can compute it in microseconds. LQR is the scaffolding on which MPC, LQG, and most practical controllers are built.

Setup

$$\dot x = Ax + Bu, \qquad J = \tfrac{1}{2} x(t_f)^\top Q_f x(t_f) + \tfrac{1}{2}\int_0^{t_f} \left( x^\top Q x + u^\top R u \right) dt$$

With $Q \succeq 0$, $Q_f \succeq 0$, $R \succ 0$. (Strict positive definiteness of $R$ is required — otherwise the minimization is ill-posed.)

The finite-horizon solution

Write $V(x,t) = \tfrac{1}{2} x^\top P(t) x$. Plug into HJB. Match quadratic coefficients. You get:

Differential Riccati equation (DRE) $$-\dot P = A^\top P + PA - PBR^{-1}B^\top P + Q, \qquad P(t_f) = Q_f$$

Integrate backward in time from $t_f$ to $0$. Then the optimal controller is

$$u^\star(t) = -R^{-1} B^\top P(t) \, x(t) = -K(t)\, x(t).$$

Feedback gain $K(t) = R^{-1} B^\top P(t)$ is time-varying.

Infinite horizon ($t_f \to \infty$)

If $(A,B)$ is stabilizable and $(A, Q^{1/2})$ is detectable, $P(t)$ converges to a constant $P_\infty$ satisfying:

Continuous Algebraic Riccati Equation (CARE) $$A^\top P + P A - P B R^{-1} B^\top P + Q = 0$$

with the stabilizing solution $P \succ 0$ (the unique one such that $A - BK$ is Hurwitz, $K = R^{-1}B^\top P$). That's your constant-gain feedback controller.

Why LQR is special

For any other dynamics or cost, $V$ is not quadratic, and you cannot parametrize it with a finite-dimensional matrix. LQR is the one case where the curse of dimensionality doesn't bite: $V$ has $n(n+1)/2$ parameters regardless of state dimension.

Lab 02 · LQR Tuner — Double Integrator
$\ddot x = u$, infinite-horizon CARE
slider: $10^x$, so weight = 1
weight = 1
weight = 1
$K = [k_1, k_2]$
pole 1
pole 2
peak $|u|$
position velocity control
Rule — Bryson's rule for Q and R

When you have no better idea, set Q = diag(1/x_{i,\max}^2) and R = diag(1/u_{j,\max}^2) where $x_{i,\max}$ and $u_{j,\max}$ are the largest acceptable deviations for each state/input. This normalizes so that each term contributes roughly equally when states/inputs are at their tolerance limits. Adjust from there.

Discrete-time LQR

For $x_{k+1} = A_d x_k + B_d u_k$ and cost $\sum_{k=0}^{N-1}(x_k^\top Q x_k + u_k^\top R u_k) + x_N^\top Q_f x_N$:

Discrete Riccati (backward) $$P_k = Q + A_d^\top P_{k+1} A_d - A_d^\top P_{k+1} B_d (R + B_d^\top P_{k+1} B_d)^{-1} B_d^\top P_{k+1} A_d$$ $$K_k = (R + B_d^\top P_{k+1} B_d)^{-1} B_d^\top P_{k+1} A_d, \qquad u_k^\star = -K_k x_k$$

Infinite horizon: iterate until $P_k$ converges — that's the Discrete Algebraic Riccati Equation (DARE) solution.

Gain margin and stability margins

LQR has beautiful classical guarantees for the full state feedback case:

These are for SISO loop-at-a-time. MIMO has multivariable equivalents (e.g., return difference inequality: $|I + L(j\omega)| \ge I$).

Caveat — these margins evaporate with an observer

LQG (next chapter) uses a Kalman filter to estimate state. The classical LQR margins do not survive the addition of an observer — this is the famous Doyle 1978 counter-example. "LQG has no guaranteed stability margins." Hence Loop Transfer Recovery and, in modern practice, the shift toward MPC with direct robustness analysis.

06
Riccati Equations

Solving the CARE/DARE: you have options.

Naive iteration of the Riccati recursion works but converges linearly. The real solvers use Schur/Hamiltonian-matrix decompositions and converge in one shot. For embedded systems, iteration can still be the right choice.

Backward iteration (the textbook method)

Just run the discrete Riccati backward from $P_N = Q_f$ until $\|P_{k+1} - P_k\|$ is below tolerance. Simple, robust, no linear algebra beyond matrix multiply and small inverse. Convergence is linear — doubling the precision takes a fixed number of iterations, typically 20–100 for well-conditioned problems.

Hamiltonian matrix / Schur method

Form the $2n \times 2n$ Hamiltonian matrix:

$$\mathcal{H} = \begin{bmatrix} A & -BR^{-1}B^\top \\ -Q & -A^\top \end{bmatrix}$$

It has $n$ stable and $n$ unstable eigenvalues (in mirror-image symmetry across the imaginary axis for CT, unit circle for DT). Compute the real Schur form, reorder so stable eigenvalues are in the top-left, extract the stable invariant subspace:

$$\mathcal{H} \begin{bmatrix} X_1 \\ X_2 \end{bmatrix} = \begin{bmatrix} X_1 \\ X_2 \end{bmatrix} S, \quad S \text{ stable} \implies P = X_2 X_1^{-1}$$

This is what MATLAB's care/dare actually does. Numerically stable, one-shot.

Doubling algorithm (structured DARE)

For DARE, the structured doubling algorithm (SDA) doubles the horizon per step: $P_k \to P_{2k}$ at each iteration. Quadratic convergence, very fast, preserves structure. Method of choice for large-scale DAREs.

Lab 03 · Riccati Iteration Convergence
backward recursion of discrete P
Watch P-matrix entries converge backward from $P_N = 0$ to $P_\infty$ (DARE solution). Convergence rate depends on plant eigenvalues and Q/R ratio.
$P_\infty(1,1)$
$P_\infty(2,2)$
iter. to tol.
$P_k(1,1)$ $P_k(1,2)$ $P_k(2,2)$

Numerical tips, the ones that matter

07
LQG & the Kalman Filter

You can't measure the state. Estimate it.

LQR assumes full state access. Real sensors measure outputs $y = Cx + v$. The optimal stochastic controller — under Gaussian noise and quadratic cost — is a Kalman filter cascaded with an LQR gain. The separation principle makes this decomposition exact.

Stochastic problem setup

$$\dot x = Ax + Bu + w, \quad y = Cx + v, \quad w \sim \mathcal{N}(0, W), \; v \sim \mathcal{N}(0, V)$$

Minimize $E\!\left[ \int_0^\infty (x^\top Q x + u^\top R u)\, dt \right]$.

The separation principle

The optimal controller is:

  1. Estimate $\hat x(t)$ via the Kalman filter (the optimal linear unbiased estimator under Gaussian noise).
  2. Apply $u(t) = -K \hat x(t)$ where $K$ is the deterministic LQR gain from the CARE.

The "filter Riccati" dual to the control Riccati gives the Kalman gain $L$:

Filter ARE (dual to CARE) $$A\Sigma + \Sigma A^\top - \Sigma C^\top V^{-1} C \Sigma + W = 0, \quad L = \Sigma C^\top V^{-1}$$

Observer dynamics: $\dot{\hat x} = A\hat x + Bu + L(y - C\hat x)$.

Doyle's result — LQG has no guaranteed margins

Even though LQR has $\ge 60°$ phase margin, cascading it with a Kalman filter can destroy robustness entirely. This famous result motivated the development of Loop Transfer Recovery (LTR) in the 1980s and eventually $\mathcal{H}_\infty$ / robust control in the '90s. Practical takeaway: for real systems with model uncertainty, verify robustness of the LQG loop explicitly — don't trust the LQR guarantees to carry over.

Discrete Kalman filter (the one you actually code)

Predict $$\hat x_{k|k-1} = A_d \hat x_{k-1|k-1} + B_d u_{k-1}$$ $$\Sigma_{k|k-1} = A_d \Sigma_{k-1|k-1} A_d^\top + W_d$$
Update $$L_k = \Sigma_{k|k-1} C^\top (C \Sigma_{k|k-1} C^\top + V)^{-1}$$ $$\hat x_{k|k} = \hat x_{k|k-1} + L_k (y_k - C \hat x_{k|k-1})$$ $$\Sigma_{k|k} = (I - L_k C)\Sigma_{k|k-1}(I - L_k C)^\top + L_k V L_k^\top$$

Use the Joseph form for $\Sigma_{k|k}$ (shown above) — symmetric and positive semi-definite under roundoff. The textbook form $(I - L_k C)\Sigma_{k|k-1}$ is algebraically equivalent but loses symmetry.

Extended and Unscented variants for nonlinear systems

08
Model Predictive Control

Receding horizon: what changed the industry.

At each control step: solve a finite-horizon optimal control problem online, apply only the first input, repeat at the next step. Handles constraints natively — which is why MPC ate the world in process control, automotive, robotics.

The MPC loop

At time $k$, given current state estimate $\hat x_k$, solve:

MPC subproblem at step $k$ $$\min_{u_{0:N-1}} \; \sum_{i=0}^{N-1} \left(x_i^\top Q x_i + u_i^\top R u_i\right) + x_N^\top P x_N$$ $$\text{s.t.} \quad x_{i+1} = A_d x_i + B_d u_i, \; x_0 = \hat x_k, \; u_i \in \mathcal{U}, \; x_i \in \mathcal{X}$$

Apply $u_k \leftarrow u_0^\star$. Discard the rest. Advance $k$. Repeat.

Why receding horizon (not just open-loop)?

Lab 04 · MPC Horizon Length vs. Closed-Loop Behavior
unconstrained finite-horizon LQR with terminal cost
The dashed line is the infinite-horizon LQR trajectory (the limit). Short $N$ with $\alpha=0$ is greedy; $\alpha=1$ recovers near-optimal behavior for any $N \ge 1$ (terminal cost effect). Use $\alpha > 1$ to see how over-weighting breaks things.
rel. cost vs $\infty$-LQR
first $u_0^\star$
terminal $\|x_N\|$
MPC position MPC velocity MPC control $\infty$-LQR position

Constraints and the QP

For linear dynamics with polytopic constraints (box bounds on $u$ and $x$, linear inequalities), the MPC subproblem is a quadratic program:

$$\min_z \tfrac{1}{2} z^\top H z + g^\top z \quad \text{s.t.} \quad A_c z \le b_c, \; A_{eq} z = b_{eq}$$

where $z = (u_0, \ldots, u_{N-1})$ (condensed form) or $z = (x_0, u_0, x_1, u_1, \ldots)$ (sparse form). Sparse form has better asymptotic complexity for long horizons; condensed form is often faster for $N \lesssim 20$.

Stability and recursive feasibility

These are the two things you must verify — not just hope for.

Recursive feasibility
If the MPC QP is feasible at step $k$, will it still be feasible at $k+1$? Not automatic with state constraints. Guaranteed by terminal set + control-invariance of that set.
Closed-loop stability
Standard sufficient conditions: (a) terminal cost $P$ is a Lyapunov function on a terminal set $\mathcal{X}_f$, (b) $\mathcal{X}_f$ is positively invariant under the terminal controller, (c) running cost is positive definite. Take $P$ = DARE solution for the unconstrained terminal controller, $\mathcal{X}_f$ = maximal positively invariant set under the LQR law.
Warm-starting
At step $k+1$, use the shifted solution from step $k$ (drop $u_0^\star$, append a terminal input) as initial guess. Huge speedup for active-set and interior-point QP solvers.

Solver landscape (what you'd actually use)

SolverMethodBest forNotes
OSQPADMM, first-ordermedium problems, warm-startingNo library deps, code-gen, robust. Industry default for embedded MPC.
HPIPMInterior-point, structuredlong horizons, real-timeRiccati-based factorization exploits sparse MPC structure. Fastest for $N\ge 20$.
qpOASESActive-set, onlinesmall problems, smooth warm-startsExcellent when active set changes slowly between steps.
GRAMPCGradient projectionnonlinear MPC, embeddedHandles NLP directly; no QP solve per step. Suboptimal, but fast.
acadosRTI / SQP frameworknonlinear, production useGenerates C code. Real-Time Iteration for NMPC.
Rule — choose $N$ rationally

For a controllable LTI system, $N$ should cover roughly 1–2 dominant time constants of the closed-loop response. Too short and terminal cost matters too much; too long and you pay solve time for no benefit. A common heuristic: pick the smallest $N$ such that doubling $N$ changes closed-loop cost by $< 1\%$. For motor-control-grade sampling at 10 kHz, typical $N = 10$–$20$.

Rule — Real-Time Iteration (Diehl)

For nonlinear MPC, don't converge the NLP at every step. Do one SQP iteration per control step (initialized by shifted previous solution). Under mild conditions, closed-loop stability is preserved and you get bounded suboptimality. This is what makes NMPC tractable at kHz rates in embedded systems — acados and GRAMPC both ride this idea.

09
Nonlinear Optimal Control

iLQR, DDP, and direct collocation.

When the dynamics are nonlinear, you have two camps: linearize-and-iterate (iLQR, SQP) or discretize-and-optimize (collocation). Both work, both are used in production.

iLQR — iterative LQR

Given a nominal trajectory $(\bar x_k, \bar u_k)$, linearize the dynamics and quadratize the cost around it. The resulting time-varying LQR sub-problem has a closed-form solution via the discrete Riccati recursion. Apply the update, relinearize, repeat. Converges superlinearly near a local optimum.

iLQR skeleton
# Given nominal (x̄, ū), cost ℓ, dynamics f
for iter in range(max_iter):
    # backward pass: build quadratic Q, solve Riccati
    V_x, V_xx = ∂ℓ_N/∂x, ∂²ℓ_N/∂x²
    for k in reversed(range(N)):
        # Q-function expansion around (x̄_k, ū_k)
        A_k, B_k = ∂f/∂x, ∂f/∂u
        Q_x  = ℓ_x  + A_k.T @ V_x
        Q_u  = ℓ_u  + B_k.T @ V_x
        Q_xx = ℓ_xx + A_k.T @ V_xx @ A_k
        Q_uu = ℓ_uu + B_k.T @ V_xx @ B_k
        Q_ux = ℓ_ux + B_k.T @ V_xx @ A_k
        # feedforward + feedback gains
        k_ff =  -Q_uu⁻¹ Q_u
        K_fb =  -Q_uu⁻¹ Q_ux
        # value update
        V_x  = Q_x  + K_fb.T @ Q_uu @ k_ff + K_fb.T @ Q_u  + Q_ux.T @ k_ff
        V_xx = Q_xx + K_fb.T @ Q_uu @ K_fb + K_fb.T @ Q_ux + Q_ux.T @ K_fb
    # forward pass with line search on α ∈ (0,1]
    for k in range(N):
        u_k = ū_k + α·k_ff_k + K_fb_k @ (x_k - x̄_k)
        x_{k+1} = f(x_k, u_k)
    if converged: break
    x̄, ū = x, u

DDP vs iLQR

DDP (Differential Dynamic Programming) also includes second-order dynamics terms $f_{xx}, f_{uu}, f_{xu}$ in the backward pass. iLQR drops them (Gauss–Newton style). In practice:

Direct collocation and shooting

Instead of solving the TPBVP, discretize $(x_0, u_0, \ldots, x_N, u_N)$ as NLP variables and impose dynamics as equality constraints between knots:

Feed the NLP to IPOPT, SNOPT, or a custom SQP. This is the workhorse of trajectory optimization for aircraft, spacecraft, and robots.

When to pick what

iLQR/DDP for smooth, unconstrained or box-constrained problems, especially when you need real-time re-planning (trajectory MPC). Collocation + IPOPT when you have complex path constraints, need high accuracy, or the problem is offline. Direct shooting (single or multiple) when dynamics integration is expensive and you want fewer decision variables.

10
Connection to Reinforcement Learning

Approximate dynamic programming, in one page.

Reinforcement learning is what you get when you drop the "known model" assumption and try to solve HJB by sampling. Everything else — policy iteration, Q-learning, actor-critic, DDPG — is a variation on that theme.

Policy iteration (the Rosetta stone)

For a fixed policy $\pi$, the value function $V^\pi$ satisfies the policy evaluation equation (linear):

$$V^\pi(x) = L(x, \pi(x)) + \gamma\, V^\pi(f(x, \pi(x)))$$

Given $V^\pi$, improve the policy greedily:

$$\pi^+(x) = \arg\min_u \{L(x,u) + \gamma V^\pi(f(x,u))\}$$

Alternating these two steps converges to $\pi^\star$. In control: this is Howard's policy iteration. In RL: same thing, but with sampled trajectories replacing exact expectations and function approximation replacing exact $V$.

The LQR–RL bridge

For LQR problems, policy iteration on $V(x) = x^\top P x$ with $\pi(x) = -Kx$ converges to the CARE solution in closed form. Every iteration solves a Lyapunov equation (linear in $P$, given $K$), which is trivial. This is known as Kleinman's algorithm (1968) — the direct ancestor of modern actor-critic methods.

Kleinman iteration (continuous-time, deterministic) $$\text{given } K_i \text{ stabilizing:}$$ $$(A - BK_i)^\top P_i + P_i(A - BK_i) + Q + K_i^\top R K_i = 0 \quad \text{(Lyapunov)}$$ $$K_{i+1} = R^{-1} B^\top P_i$$

Quadratic convergence to CARE solution. Each step is a Lyapunov equation. Foundational.

Q-learning and the discrete-time dual

Define the state-action value $Q^\pi(x,u) = L(x,u) + \gamma V^\pi(f(x,u))$. Then the optimal $Q^\star$ satisfies:

$$Q^\star(x,u) = L(x,u) + \gamma \min_{u'} Q^\star(f(x,u), u')$$

This is what Q-learning approximates. For LQR, $Q^\star(x,u) = [x;u]^\top H [x;u]$ is quadratic — and the greedy policy $u^\star = -H_{uu}^{-1} H_{ux}\,x$ is exactly the LQR gain. You can learn $H$ without knowing $(A,B)$ by solving least-squares on observed $(x,u,x')$ tuples. This is model-free LQR — a staple of adaptive dynamic programming.

Why this matters for the practitioner

If you know your plant, use the model: solve CARE/DARE, deploy. It's faster, more data-efficient, and has provable robustness bounds. Use RL/ADP when the model is unknown, time-varying, or when the state space is so large that parametric $V$/$Q$ approximation is necessary. For adaptive motor control, the Kleinman framework lets you update $K$ online using measured data without re-identifying $(A,B)$ — often the right tool.

11
Rules of Thumb

What twenty years of field deployment teaches you.

Not theorems. Heuristics. The stuff that saves you a week when something isn't working.

Tuning Q and R

Bryson's rule (already mentioned, repeated for emphasis)

Start with $Q = \text{diag}(1/x_{i,\max}^2)$, $R = \text{diag}(1/u_{j,\max}^2)$ using "acceptable peak deviation" for each channel. Then scale $Q$ or $R$ uniformly to trade performance for effort.

One knob is enough for many problems

Fix $Q = I$ and sweep a scalar $\rho$ in $R = \rho I$. As $\rho \to 0$, the controller becomes aggressive; as $\rho \to \infty$, lazy. The bandwidth of the closed loop scales roughly like $\rho^{-1/2}$ for the double integrator. Plot the Bode/root locus as $\rho$ varies. This is the "LQR design curve."

Cross-terms $Q_{ij}$ are rarely needed

If you think you need $Q_{12} \ne 0$ to "penalize the product of error and velocity," you usually don't. Re-examine whether your cost has physical meaning or you're over-specifying. Diagonal $Q$, $R$ covers 90% of practical cases.

Discretization

Sample rate vs. bandwidth

Sample at $f_s \ge 10\text{–}20 \times$ the desired closed-loop bandwidth. $5\times$ is the absolute floor (Shannon minimum is much less, but the phase loss from ZOH kills you). For motor control: 10 kHz sampling is common for kHz-bandwidth current loops.

Use ZOH exact discretization for LTI

$A_d = e^{AT_s}$, $B_d = \int_0^{T_s} e^{A\tau}\, d\tau \cdot B$. Use matrix exponential (expm / c2d). Euler forward ($A_d = I + AT_s$) is fine for fast sampling but can destabilize if $\|A\|T_s \gtrsim 0.2$. Never Euler-discretize a fast mode you care about.

Bilinear (Tustin) for frequency-domain preservation

If you care about frequency response matching — not time-domain — Tustin $s = \tfrac{2}{T_s}\tfrac{z-1}{z+1}$ maps the $j\omega$ axis to the unit circle with warping. Use pre-warping at the crossover frequency.

Sanity checks before deployment

Check controllability / observability

Before tuning: verify $\text{rank}[B, AB, \ldots, A^{n-1}B] = n$ and dually for observability. If rank-deficient, LQR/Kalman won't be well-posed and you need to fix the structure (add a sensor, add an input, remove a state) — not tune harder.

Gramians for weak modes

Small eigenvalues of the controllability/observability Gramian indicate weakly controllable/observable modes. Technically you're controllable/observable, but in practice you need enormous gain or energy. Balanced truncation or re-design the plant.

Check $A - BK$ eigenvalues

For continuous LQR: eigenvalues should be in the open left half-plane. For discrete: inside unit circle. If you computed $K$ but the closed loop is unstable, your CARE solver returned the wrong fixed point (non-stabilizing solution). This happens with bad problem conditioning.

Singular values of $I + L$

For MIMO, plot $\sigma_{\min}(I + L(j\omega))$ where $L = K(sI-A)^{-1}B$. Classical LQR guarantees $\sigma_{\min} \ge 1$. This is stronger than SISO margins and is the MIMO analog of the return-difference condition.

MPC-specific

Terminal cost = DARE solution, always

Use $P$ = solution of the unconstrained DARE as the terminal weight. This makes finite-horizon MPC behave like infinite-horizon LQR deep in the feasible region, which is the desired behavior. It also gives you a Lyapunov function for stability proofs, free.

If it doesn't behave like LQR deep inside, you have a bug

Inside the constraint set, unconstrained MPC with $P$ = DARE should produce exactly the LQR response. If it doesn't, check: wrong $P$, wrong discretization, horizon too short, solver not converged, cost function scaling off.

Soft constraints on states, hard on inputs

Actuator limits are physical and hard. State constraints (velocity limits, position bounds) often want slack — use $\ell_1$-penalized slack variables to preserve feasibility when a disturbance briefly pushes the state out of the safe set. Never soft-constraint a safety-critical bound, though.

Warm-start, always

Shift previous solution by one step as initial guess for the next QP. This is free and can reduce solve time by 3–10×. Every serious MPC solver supports it.

Numerical hygiene

Scale your states

If $x_1$ is in meters (O(1)) and $x_2$ is in rad/s (O(100)), condition numbers blow up. Scale to unit magnitude via a diagonal transformation $\tilde x = D^{-1} x$. Apply the inverse scaling to the resulting gain. This alone can fix half of "LQR weirdness" bug reports.

Symmetrize $P$ after every update

Floating-point roundoff breaks $P^\top = P$. After each Riccati iteration, replace $P \leftarrow (P + P^\top)/2$. Takes one line. Prevents drift toward indefiniteness.

Use Joseph form for the Kalman covariance update

Instead of $\Sigma^+ = (I - LC)\Sigma$, use $(I-LC)\Sigma(I-LC)^\top + LVL^\top$. Preserves symmetry and positive semi-definiteness exactly (in exact arithmetic) and robustly (in finite precision).

12
Implementation Notes

Getting from paper to silicon.

These are the issues that show up when your controller has to run in 100 μs on an automotive-grade MCU, not on a MATLAB desktop.

Offline vs. online computation

For LTI plants, push everything offline that possibly can be:

Fixed-point vs. floating-point

Automotive and aerospace platforms often mandate fixed-point to avoid floating-point exceptions and determinism issues (and because AUTOSAR-generated C from TargetLink frequently targets fixed-point integer types). Implications:

TargetLink / AUTOSAR specifics

The TargetLink data dictionary type resolution matters. If your LQR gain block outputs a float32 but the consumer is a uint16 interface, you'll get implicit casts that wrap negative values — this is the exact class of bug that kills HCI sine signal paths and similar. Always enforce type consistency at module boundaries and use Lib_* blocks with explicit type configuration. See also the discussion of Lib_Switch type promotion rules in the TL reference.

MPC on a microcontroller

Yes, it's feasible. Here's the reality:

State estimator: covariance propagation tricks

Validation workflow

  1. Model-in-the-loop (MIL): Simulink/numerical, bit-exact to design. Full coverage including corner cases.
  2. Software-in-the-loop (SIL): generated C against plant simulation. Verifies code-gen hasn't introduced bugs.
  3. Processor-in-the-loop (PIL): C running on target MCU, plant on host. Exposes timing and fixed-point effects.
  4. Hardware-in-the-loop (HIL): full controller + real sensors/actuators against plant model. This is where you find integration bugs (message ordering, timer drift, CAN bus jitter).
  5. Bench testrig testvehicle test. Measure closed-loop frequency response. Compare to LTI model to catch unmodeled dynamics before they bite.
Where people get burned

The single most common failure mode for deployed optimal controllers: a subtle difference between the design model (continuous-time LTI, nominal parameters) and the as-deployed plant (discrete-time, time-varying, with delays, saturation, dead-zone, sensor latency). Always verify closed-loop stability under realistic deployment conditions — not just nominal. Add a 1–2 sample delay to your design model as a robustness check. If LQR breaks, your margins were too tight.

13
Glossary & Symbol Table

Every term, one definition.

When you read a new optimal-control paper, come back here. We all use the same symbols with subtly different conventions.

Statex ∈ ℝⁿ
Minimal information needed to predict future evolution of the system given future inputs. Not identical to measurements ("outputs $y$").
Control inputu ∈ ℝᵐ
Vector of actuator commands; the degrees of freedom the controller gets to choose. Usually constrained to a set $\mathcal{U}$.
Costate / Adjointλ ∈ ℝⁿ
Lagrange multiplier on the dynamics constraint in PMP; equals the gradient of the value function, $\lambda = \nabla_x V$.
HamiltonianH = L + λᵀf
The scalar function minimized in the control at each instant in PMP. For control-affine systems with quadratic cost: $H = \tfrac{1}{2}u^\top Ru + \lambda^\top B u + \ldots$
Value functionV(x,t)
Optimal cost-to-go from state $x$ at time $t$. Satisfies the HJB PDE. For LQR: $V(x) = x^\top P x$.
Policyπ: x ↦ u
A state-feedback control law. "Optimal policy" $\pi^\star(x) = \arg\min_u \{L + \nabla_x V \cdot f\}$.
Bolza / Lagrange / MayerJ
Cost functional variants: Bolza = running + terminal; Lagrange = running only; Mayer = terminal only. All three are inter-convertible via state augmentation.
Stabilizable(A,B)
Uncontrollable modes (if any) are stable. Weaker than controllable. Required for infinite-horizon LQR existence.
Detectable(A,C)
Dual: unobservable modes are stable. Required for Kalman filter existence / CARE uniqueness.
Controllability GramianW_c
$W_c = \int_0^\infty e^{A\tau} B B^\top e^{A^\top \tau}\, d\tau$. Measures "how much energy" to reach each direction. Small eigenvalues $\Rightarrow$ weakly controllable modes.
CARE
Continuous Algebraic Riccati Equation: $A^\top P + PA - PBR^{-1}B^\top P + Q = 0$. LQR steady-state solution.
DARE
Discrete Algebraic Riccati Equation. Analog of CARE for discrete-time systems.
Transversality
Boundary conditions on the costate/value at free endpoints: $\lambda(t_f) = \nabla_x \phi$, $H(t_f) = -\partial\phi/\partial t$ if $t_f$ free.
Bang-bang
Optimal $u$ takes values only on the boundary $\partial\mathcal{U}$; switches discretely. Characteristic of minimum-time and minimum-fuel problems with linear $H$ in $u$.
Singular arc
Interval where the switching function $\sigma = \partial H/\partial u$ is identically zero. Optimal $u$ is determined only after differentiating $\sigma$ enough times.
Receding horizon
The MPC strategy: solve a finite-horizon problem, apply only the first step, roll the horizon forward. Gives implicit state feedback.
Recursive feasibility
Property that if MPC QP is solvable at step $k$, it remains solvable at step $k+1$. Ensured by terminal constraint on a positively invariant set.
Terminal cost / setP, 𝒳f
Cost added at the horizon end to approximate the "tail" of the infinite-horizon problem. Terminal set ensures stability of the closed loop.
Warm start
Initialize solver at step $k+1$ with the shifted solution from step $k$. Free, major speedup. Standard practice.
Separation principle
For LQG: optimal stochastic controller = Kalman filter + LQR gain. The estimation and control problems decouple.
Joseph form
Kalman covariance update written as $(I-LC)\Sigma(I-LC)^\top + LVL^\top$ — numerically robust variant.
HJB
Hamilton–Jacobi–Bellman PDE: the continuous-time dynamic programming equation for $V$.
ZOH
Zero-order hold: the most common way to discretize LTI systems — control held constant over each sample period.
iLQR
Iterative LQR. Gauss–Newton-style iterative nonlinear trajectory optimization; each iter solves a time-varying LQR via DRE.
DDP
Differential Dynamic Programming. Like iLQR but includes second-order dynamics expansions in the backward pass.
Kleinman iteration
Policy iteration for LQR: alternates Lyapunov equation (value) with gain update. Converges quadratically to CARE solution.
Q-functionQ(x,u)
State-action value. $Q(x,u) = L + \gamma V(f(x,u))$. For LQR, quadratic in $[x;u]$. Basis of Q-learning and model-free ADP.
WCET
Worst-Case Execution Time. Mandatory certification metric for safety-critical real-time control. Bounded by capping solver iterations.
LTR
Loop Transfer Recovery. 1980s technique to recover LQR-like margins in LQG by aggressively tuning the Kalman filter to approximate full-state feedback.
RTI
Real-Time Iteration. NMPC strategy where only one SQP iteration is performed per control step. Enables kHz-rate NMPC.
· · · FIN · · ·