A working engineer's guide to optimal control: from the calculus of variations to Pontryagin's minimum principle, from LQR and Riccati equations to model predictive control and nonlinear trajectory optimization. Every section is paired with runnable simulations, practical implementation notes for embedded/real-time systems, and field-tested tuning heuristics.
A dynamical system has states $x$ and inputs $u$. You choose $u(t)$ over time to minimize a cost — subject to physics. That's it. Everything else is mathematical machinery for doing it well.
Given dynamics $\dot{x} = f(x,u,t)$ with initial state $x(t_0) = x_0$, find the input trajectory $u(\cdot)$ on $[t_0, t_f]$ that minimizes the Bolza cost functional:
Subject to path constraints: $g(x,u,t) \le 0$, terminal constraints: $\psi(x(t_f), t_f) = 0$, and possibly state/control bounds: $u(t) \in \mathcal{U}$, $x(t) \in \mathcal{X}$.
Three common degenerate forms you'll see named in the wild:
You do not get a closed-form $u^\star(t)$ by magic. You get one of three things, depending on the approach:
Pontryagin tells you the necessary conditions that any optimal trajectory must satisfy. Bellman (HJB) tells you the sufficient condition via a value function. LQR is the one case where both collapse into a clean, computable answer. MPC is "solve a finite-horizon Pontryagin problem, apply the first step, repeat."
We use lower-case $x$ for state (not $\mathbf{x}$). Vectors are columns. Gradients $\nabla_x L$ are row or column per context (we'll be explicit when it matters). Time-argument dropped when clear from context: $f(x,u)$ means $f(x(t), u(t), t)$ for the time-varying case.
Before optimal control was a field, there was the variational calculus. Most of the machinery — costate, Hamiltonian, transversality — is inherited directly from it.
Consider minimizing $J[x(\cdot)] = \int_{t_0}^{t_f} L(x(t), \dot x(t), t)\, dt$ over smooth trajectories $x(\cdot)$ with fixed endpoints. A necessary condition for a minimizer is the Euler–Lagrange equation:
Derivation sketch: perturb $x \to x + \epsilon \eta$ with $\eta$ vanishing at endpoints. Require $\left.\tfrac{d}{d\epsilon}\right|_{\epsilon=0} J = 0$. Integrate by parts. The "arbitrariness" of $\eta$ gives the equation pointwise.
If $t_f$ or $x(t_f)$ is free, you get boundary conditions for free:
Optimal control generalizes this by splitting $\dot x$ into "dynamics $f(x,u,t)$", introducing $u$ as a separate variable, and handling constraints explicitly. The costate $\lambda$ plays the role that $\partial L / \partial \dot x$ played in the classical problem.
If you ever want to derive custom optimality conditions (e.g., minimum-energy trajectory on a manifold, optimal path through a vector field), the variational viewpoint is faster than grinding through the full PMP machinery. Just remember: for inequality constraints or non-smooth $L$, classical CoV breaks down — use Pontryagin.
If $u^\star(\cdot)$ is optimal, there exist costates $\lambda^\star(\cdot)$ such that the state, costate, and input jointly satisfy a two-point boundary value problem with a pointwise minimization in $u$.
Define the Hamiltonian
Then along the optimal trajectory:
The first two give you a two-point boundary value problem (TPBVP): $x$ known at $t_0$, $\lambda$ known at $t_f$. The third is the point-wise optimality. The difficulty of the problem lives almost entirely in the TPBVP — it's forward-backward, which is why indirect methods are numerically tricky.
If $u$ is unconstrained and $H$ is differentiable in $u$, then "minimize in $u$" becomes $\nabla_u H = 0$. For $L = \tfrac{1}{2} u^\top R u + \cdots$ with $R \succ 0$ and $f$ control-affine ($f = a(x) + B(x) u$), this gives:
This is the control-affine shortcut that powers LQR, iLQR, and most of classical optimal control.
If $u \in [u_{\min}, u_{\max}]$ and $H$ is linear in $u$ (which happens for $L$ independent of $u$, e.g. minimum-time), then minimizing $H$ picks $u$ at a corner of the constraint set whenever the "switching function" $\sigma(t) = \lambda^\top B$ is nonzero:
When $\sigma(t) \equiv 0$ on an interval, you have a singular arc, and $u$ is determined by differentiating $\sigma$ until $u$ appears — a delicate case.
For trajectory generation offline, direct methods (discretize then optimize via NLP, e.g. collocation) are almost always easier than indirect methods (shoot the TPBVP). The indirect approach gives higher accuracy if your initial guess for $\lambda(t_0)$ is decent — which it usually isn't. Use indirect when you need ultra-high precision (e.g. interplanetary trajectory design) or when the structure (singular arcs, bang-bang switches) is analytically known.
The value function $V(x,t)$ satisfies a partial differential equation. Solving it directly gives a feedback policy — no TPBVP, no initial costate guess. The catch is the curse of dimensionality.
Bellman's insight, crudely: whatever you did to get to state $x$ at time $t$, the optimal continuation from $(x,t)$ only depends on $(x,t)$. Formally:
Taking $\Delta t \to 0$ and expanding $V$:
Notice the clean correspondence: $\lambda = \nabla_x V$. The costate is the gradient of the value function. This is exactly why PMP and HJB agree where both are applicable — they're two views of the same optimality.
Solvable by backward recursion when the state/action spaces are finite and small. For continuous states, you must parametrize $V$ — linear combinations of basis functions, neural networks, tensor trains, etc. That's where Approximate Dynamic Programming lives.
A grid of $N$ points per dimension over $n$ state dimensions is $N^n$ cells. For $n=10$ and $N=100$ that's $10^{20}$ — hopeless. This is why HJB is rarely solved directly outside $n \le 4$ or so. The exceptions:
$V$ is typically non-smooth (corners appear where bang-bang switches happen). Classical PDE solutions can be ambiguous. The correct notion is a viscosity solution (Crandall–Lions). If you're doing numerical HJB with upwind schemes, you're implicitly selecting the viscosity solution — good. If you're naively differentiating, you can be wrong — bad.
Linear dynamics, quadratic cost. PMP and HJB agree, the value function is quadratic, the optimal controller is linear state feedback, and you can compute it in microseconds. LQR is the scaffolding on which MPC, LQG, and most practical controllers are built.
With $Q \succeq 0$, $Q_f \succeq 0$, $R \succ 0$. (Strict positive definiteness of $R$ is required — otherwise the minimization is ill-posed.)
Write $V(x,t) = \tfrac{1}{2} x^\top P(t) x$. Plug into HJB. Match quadratic coefficients. You get:
Integrate backward in time from $t_f$ to $0$. Then the optimal controller is
Feedback gain $K(t) = R^{-1} B^\top P(t)$ is time-varying.
If $(A,B)$ is stabilizable and $(A, Q^{1/2})$ is detectable, $P(t)$ converges to a constant $P_\infty$ satisfying:
with the stabilizing solution $P \succ 0$ (the unique one such that $A - BK$ is Hurwitz, $K = R^{-1}B^\top P$). That's your constant-gain feedback controller.
For any other dynamics or cost, $V$ is not quadratic, and you cannot parametrize it with a finite-dimensional matrix. LQR is the one case where the curse of dimensionality doesn't bite: $V$ has $n(n+1)/2$ parameters regardless of state dimension.
When you have no better idea, set Q = diag(1/x_{i,\max}^2) and R = diag(1/u_{j,\max}^2) where $x_{i,\max}$ and $u_{j,\max}$ are the largest acceptable deviations for each state/input. This normalizes so that each term contributes roughly equally when states/inputs are at their tolerance limits. Adjust from there.
For $x_{k+1} = A_d x_k + B_d u_k$ and cost $\sum_{k=0}^{N-1}(x_k^\top Q x_k + u_k^\top R u_k) + x_N^\top Q_f x_N$:
Infinite horizon: iterate until $P_k$ converges — that's the Discrete Algebraic Riccati Equation (DARE) solution.
LQR has beautiful classical guarantees for the full state feedback case:
These are for SISO loop-at-a-time. MIMO has multivariable equivalents (e.g., return difference inequality: $|I + L(j\omega)| \ge I$).
LQG (next chapter) uses a Kalman filter to estimate state. The classical LQR margins do not survive the addition of an observer — this is the famous Doyle 1978 counter-example. "LQG has no guaranteed stability margins." Hence Loop Transfer Recovery and, in modern practice, the shift toward MPC with direct robustness analysis.
Naive iteration of the Riccati recursion works but converges linearly. The real solvers use Schur/Hamiltonian-matrix decompositions and converge in one shot. For embedded systems, iteration can still be the right choice.
Just run the discrete Riccati backward from $P_N = Q_f$ until $\|P_{k+1} - P_k\|$ is below tolerance. Simple, robust, no linear algebra beyond matrix multiply and small inverse. Convergence is linear — doubling the precision takes a fixed number of iterations, typically 20–100 for well-conditioned problems.
Form the $2n \times 2n$ Hamiltonian matrix:
It has $n$ stable and $n$ unstable eigenvalues (in mirror-image symmetry across the imaginary axis for CT, unit circle for DT). Compute the real Schur form, reorder so stable eigenvalues are in the top-left, extract the stable invariant subspace:
This is what MATLAB's care/dare actually does. Numerically stable, one-shot.
For DARE, the structured doubling algorithm (SDA) doubles the horizon per step: $P_k \to P_{2k}$ at each iteration. Quadratic convergence, very fast, preserves structure. Method of choice for large-scale DAREs.
balreal in MATLAB, or equivalent scaling in SciPy.LQR assumes full state access. Real sensors measure outputs $y = Cx + v$. The optimal stochastic controller — under Gaussian noise and quadratic cost — is a Kalman filter cascaded with an LQR gain. The separation principle makes this decomposition exact.
Minimize $E\!\left[ \int_0^\infty (x^\top Q x + u^\top R u)\, dt \right]$.
The optimal controller is:
The "filter Riccati" dual to the control Riccati gives the Kalman gain $L$:
Observer dynamics: $\dot{\hat x} = A\hat x + Bu + L(y - C\hat x)$.
Even though LQR has $\ge 60°$ phase margin, cascading it with a Kalman filter can destroy robustness entirely. This famous result motivated the development of Loop Transfer Recovery (LTR) in the 1980s and eventually $\mathcal{H}_\infty$ / robust control in the '90s. Practical takeaway: for real systems with model uncertainty, verify robustness of the LQG loop explicitly — don't trust the LQR guarantees to carry over.
Use the Joseph form for $\Sigma_{k|k}$ (shown above) — symmetric and positive semi-definite under roundoff. The textbook form $(I - L_k C)\Sigma_{k|k-1}$ is algebraically equivalent but loses symmetry.
At each control step: solve a finite-horizon optimal control problem online, apply only the first input, repeat at the next step. Handles constraints natively — which is why MPC ate the world in process control, automotive, robotics.
At time $k$, given current state estimate $\hat x_k$, solve:
Apply $u_k \leftarrow u_0^\star$. Discard the rest. Advance $k$. Repeat.
For linear dynamics with polytopic constraints (box bounds on $u$ and $x$, linear inequalities), the MPC subproblem is a quadratic program:
where $z = (u_0, \ldots, u_{N-1})$ (condensed form) or $z = (x_0, u_0, x_1, u_1, \ldots)$ (sparse form). Sparse form has better asymptotic complexity for long horizons; condensed form is often faster for $N \lesssim 20$.
These are the two things you must verify — not just hope for.
| Solver | Method | Best for | Notes |
|---|---|---|---|
| OSQP | ADMM, first-order | medium problems, warm-starting | No library deps, code-gen, robust. Industry default for embedded MPC. |
| HPIPM | Interior-point, structured | long horizons, real-time | Riccati-based factorization exploits sparse MPC structure. Fastest for $N\ge 20$. |
| qpOASES | Active-set, online | small problems, smooth warm-starts | Excellent when active set changes slowly between steps. |
| GRAMPC | Gradient projection | nonlinear MPC, embedded | Handles NLP directly; no QP solve per step. Suboptimal, but fast. |
| acados | RTI / SQP framework | nonlinear, production use | Generates C code. Real-Time Iteration for NMPC. |
For a controllable LTI system, $N$ should cover roughly 1–2 dominant time constants of the closed-loop response. Too short and terminal cost matters too much; too long and you pay solve time for no benefit. A common heuristic: pick the smallest $N$ such that doubling $N$ changes closed-loop cost by $< 1\%$. For motor-control-grade sampling at 10 kHz, typical $N = 10$–$20$.
For nonlinear MPC, don't converge the NLP at every step. Do one SQP iteration per control step (initialized by shifted previous solution). Under mild conditions, closed-loop stability is preserved and you get bounded suboptimality. This is what makes NMPC tractable at kHz rates in embedded systems — acados and GRAMPC both ride this idea.
When the dynamics are nonlinear, you have two camps: linearize-and-iterate (iLQR, SQP) or discretize-and-optimize (collocation). Both work, both are used in production.
Given a nominal trajectory $(\bar x_k, \bar u_k)$, linearize the dynamics and quadratize the cost around it. The resulting time-varying LQR sub-problem has a closed-form solution via the discrete Riccati recursion. Apply the update, relinearize, repeat. Converges superlinearly near a local optimum.
# Given nominal (x̄, ū), cost ℓ, dynamics f for iter in range(max_iter): # backward pass: build quadratic Q, solve Riccati V_x, V_xx = ∂ℓ_N/∂x, ∂²ℓ_N/∂x² for k in reversed(range(N)): # Q-function expansion around (x̄_k, ū_k) A_k, B_k = ∂f/∂x, ∂f/∂u Q_x = ℓ_x + A_k.T @ V_x Q_u = ℓ_u + B_k.T @ V_x Q_xx = ℓ_xx + A_k.T @ V_xx @ A_k Q_uu = ℓ_uu + B_k.T @ V_xx @ B_k Q_ux = ℓ_ux + B_k.T @ V_xx @ A_k # feedforward + feedback gains k_ff = -Q_uu⁻¹ Q_u K_fb = -Q_uu⁻¹ Q_ux # value update V_x = Q_x + K_fb.T @ Q_uu @ k_ff + K_fb.T @ Q_u + Q_ux.T @ k_ff V_xx = Q_xx + K_fb.T @ Q_uu @ K_fb + K_fb.T @ Q_ux + Q_ux.T @ K_fb # forward pass with line search on α ∈ (0,1] for k in range(N): u_k = ū_k + α·k_ff_k + K_fb_k @ (x_k - x̄_k) x_{k+1} = f(x_k, u_k) if converged: break x̄, ū = x, u
DDP (Differential Dynamic Programming) also includes second-order dynamics terms $f_{xx}, f_{uu}, f_{xu}$ in the backward pass. iLQR drops them (Gauss–Newton style). In practice:
Instead of solving the TPBVP, discretize $(x_0, u_0, \ldots, x_N, u_N)$ as NLP variables and impose dynamics as equality constraints between knots:
Feed the NLP to IPOPT, SNOPT, or a custom SQP. This is the workhorse of trajectory optimization for aircraft, spacecraft, and robots.
iLQR/DDP for smooth, unconstrained or box-constrained problems, especially when you need real-time re-planning (trajectory MPC). Collocation + IPOPT when you have complex path constraints, need high accuracy, or the problem is offline. Direct shooting (single or multiple) when dynamics integration is expensive and you want fewer decision variables.
Reinforcement learning is what you get when you drop the "known model" assumption and try to solve HJB by sampling. Everything else — policy iteration, Q-learning, actor-critic, DDPG — is a variation on that theme.
For a fixed policy $\pi$, the value function $V^\pi$ satisfies the policy evaluation equation (linear):
Given $V^\pi$, improve the policy greedily:
Alternating these two steps converges to $\pi^\star$. In control: this is Howard's policy iteration. In RL: same thing, but with sampled trajectories replacing exact expectations and function approximation replacing exact $V$.
For LQR problems, policy iteration on $V(x) = x^\top P x$ with $\pi(x) = -Kx$ converges to the CARE solution in closed form. Every iteration solves a Lyapunov equation (linear in $P$, given $K$), which is trivial. This is known as Kleinman's algorithm (1968) — the direct ancestor of modern actor-critic methods.
Quadratic convergence to CARE solution. Each step is a Lyapunov equation. Foundational.
Define the state-action value $Q^\pi(x,u) = L(x,u) + \gamma V^\pi(f(x,u))$. Then the optimal $Q^\star$ satisfies:
This is what Q-learning approximates. For LQR, $Q^\star(x,u) = [x;u]^\top H [x;u]$ is quadratic — and the greedy policy $u^\star = -H_{uu}^{-1} H_{ux}\,x$ is exactly the LQR gain. You can learn $H$ without knowing $(A,B)$ by solving least-squares on observed $(x,u,x')$ tuples. This is model-free LQR — a staple of adaptive dynamic programming.
If you know your plant, use the model: solve CARE/DARE, deploy. It's faster, more data-efficient, and has provable robustness bounds. Use RL/ADP when the model is unknown, time-varying, or when the state space is so large that parametric $V$/$Q$ approximation is necessary. For adaptive motor control, the Kleinman framework lets you update $K$ online using measured data without re-identifying $(A,B)$ — often the right tool.
Not theorems. Heuristics. The stuff that saves you a week when something isn't working.
Start with $Q = \text{diag}(1/x_{i,\max}^2)$, $R = \text{diag}(1/u_{j,\max}^2)$ using "acceptable peak deviation" for each channel. Then scale $Q$ or $R$ uniformly to trade performance for effort.
Fix $Q = I$ and sweep a scalar $\rho$ in $R = \rho I$. As $\rho \to 0$, the controller becomes aggressive; as $\rho \to \infty$, lazy. The bandwidth of the closed loop scales roughly like $\rho^{-1/2}$ for the double integrator. Plot the Bode/root locus as $\rho$ varies. This is the "LQR design curve."
If you think you need $Q_{12} \ne 0$ to "penalize the product of error and velocity," you usually don't. Re-examine whether your cost has physical meaning or you're over-specifying. Diagonal $Q$, $R$ covers 90% of practical cases.
Sample at $f_s \ge 10\text{–}20 \times$ the desired closed-loop bandwidth. $5\times$ is the absolute floor (Shannon minimum is much less, but the phase loss from ZOH kills you). For motor control: 10 kHz sampling is common for kHz-bandwidth current loops.
$A_d = e^{AT_s}$, $B_d = \int_0^{T_s} e^{A\tau}\, d\tau \cdot B$. Use matrix exponential (expm / c2d). Euler forward ($A_d = I + AT_s$) is fine for fast sampling but can destabilize if $\|A\|T_s \gtrsim 0.2$. Never Euler-discretize a fast mode you care about.
If you care about frequency response matching — not time-domain — Tustin $s = \tfrac{2}{T_s}\tfrac{z-1}{z+1}$ maps the $j\omega$ axis to the unit circle with warping. Use pre-warping at the crossover frequency.
Before tuning: verify $\text{rank}[B, AB, \ldots, A^{n-1}B] = n$ and dually for observability. If rank-deficient, LQR/Kalman won't be well-posed and you need to fix the structure (add a sensor, add an input, remove a state) — not tune harder.
Small eigenvalues of the controllability/observability Gramian indicate weakly controllable/observable modes. Technically you're controllable/observable, but in practice you need enormous gain or energy. Balanced truncation or re-design the plant.
For continuous LQR: eigenvalues should be in the open left half-plane. For discrete: inside unit circle. If you computed $K$ but the closed loop is unstable, your CARE solver returned the wrong fixed point (non-stabilizing solution). This happens with bad problem conditioning.
For MIMO, plot $\sigma_{\min}(I + L(j\omega))$ where $L = K(sI-A)^{-1}B$. Classical LQR guarantees $\sigma_{\min} \ge 1$. This is stronger than SISO margins and is the MIMO analog of the return-difference condition.
Use $P$ = solution of the unconstrained DARE as the terminal weight. This makes finite-horizon MPC behave like infinite-horizon LQR deep in the feasible region, which is the desired behavior. It also gives you a Lyapunov function for stability proofs, free.
Inside the constraint set, unconstrained MPC with $P$ = DARE should produce exactly the LQR response. If it doesn't, check: wrong $P$, wrong discretization, horizon too short, solver not converged, cost function scaling off.
Actuator limits are physical and hard. State constraints (velocity limits, position bounds) often want slack — use $\ell_1$-penalized slack variables to preserve feasibility when a disturbance briefly pushes the state out of the safe set. Never soft-constraint a safety-critical bound, though.
Shift previous solution by one step as initial guess for the next QP. This is free and can reduce solve time by 3–10×. Every serious MPC solver supports it.
If $x_1$ is in meters (O(1)) and $x_2$ is in rad/s (O(100)), condition numbers blow up. Scale to unit magnitude via a diagonal transformation $\tilde x = D^{-1} x$. Apply the inverse scaling to the resulting gain. This alone can fix half of "LQR weirdness" bug reports.
Floating-point roundoff breaks $P^\top = P$. After each Riccati iteration, replace $P \leftarrow (P + P^\top)/2$. Takes one line. Prevents drift toward indefiniteness.
Instead of $\Sigma^+ = (I - LC)\Sigma$, use $(I-LC)\Sigma(I-LC)^\top + LVL^\top$. Preserves symmetry and positive semi-definiteness exactly (in exact arithmetic) and robustly (in finite precision).
These are the issues that show up when your controller has to run in 100 μs on an automotive-grade MCU, not on a MATLAB desktop.
For LTI plants, push everything offline that possibly can be:
Automotive and aerospace platforms often mandate fixed-point to avoid floating-point exceptions and determinism issues (and because AUTOSAR-generated C from TargetLink frequently targets fixed-point integer types). Implications:
The TargetLink data dictionary type resolution matters. If your LQR gain block outputs a float32 but the consumer is a uint16 interface, you'll get implicit casts that wrap negative values — this is the exact class of bug that kills HCI sine signal paths and similar. Always enforce type consistency at module boundaries and use Lib_* blocks with explicit type configuration. See also the discussion of Lib_Switch type promotion rules in the TL reference.
Yes, it's feasible. Here's the reality:
The single most common failure mode for deployed optimal controllers: a subtle difference between the design model (continuous-time LTI, nominal parameters) and the as-deployed plant (discrete-time, time-varying, with delays, saturation, dead-zone, sensor latency). Always verify closed-loop stability under realistic deployment conditions — not just nominal. Add a 1–2 sample delay to your design model as a robustness check. If LQR breaks, your margins were too tight.
When you read a new optimal-control paper, come back here. We all use the same symbols with subtly different conventions.