Reinforcement learning has an identity problem in engineering circles. Half the field treats it like a magic wand — sprinkle RL on a hard problem and watch it solve itself. The other half treats it like a toy — nice for games, useless on a dyno. Both views are wrong, and neither helps someone with a real control or robotics problem decide whether to reach for RL, and if so, which variant.

This is the first post in a series attempting to fix that — a working field guide to RL written from the perspective of someone who wants to actually ship policies onto physical systems. It’s opinionated, because useful guides have to be.

What this series covers

Seven posts. This one — the hub — lays out the map: the three architectural choices every RL algorithm makes, a quick-reference comparison of the algorithms you’ll see in every paper, a five-question selector that tells you what to reach for first, and a compressed cheatsheet of the hyperparameters and traps that take years to learn. It’s publishable on its own as an overview.

The six deep-dive posts below cover each topic with the math, the intuition, and an interactive demo where one makes sense. Each one can be read independently but they form a coherent arc.

  1. Foundations — MDPs, value functions, and the Bellman equations. The formalism and what it really means. The Markov assumption, reward shaping, discount factors, and why writing down the MDP correctly is 80% of the work.
  2. Temporal difference learning, SARSA, and Q-learning. Bootstrapping, the TD error, on-policy vs off-policy — and why Q-learning is the algorithm that gave rise to everything modern.
  3. Policy gradients and actor-critic. REINFORCE → advantage estimation → TRPO → PPO. Where the variance comes from and what GAE does about it.
  4. Exploration and modern deep RL — SAC, PPO, TD3, DDPG. The four algorithms that dominate continuous control, what each one actually does, and when each one is the right call.
  5. Model-based RL and MPC hybrids. Where RL meets your existing MPC infrastructure. Learned dynamics, uncertainty-aware planning, and the hybrid architectures that work on real hardware.
  6. Robotics in practice — sim-to-real, offline RL, and safe RL. The parts no one teaches in school: domain randomization, offline policy learning, safety filters, and the difference between a policy that works in simulation and one that deploys.

The rest of this post is self-contained and delivers real value on its own — even if you never click any of the links above, you’ll leave with a working taxonomy and concrete algorithmic recommendations.

The RL family tree, without the confusion

Every RL algorithm makes three architectural choices. Once you see them, all the acronyms click into place.

Axis 1 — Model-free vs model-based

Does the algorithm learn or use a model of the environment dynamics $P(s’ \mid s, a)$?

  • Model-free learns a value function and/or policy directly from reward signals. Simpler, more general, needs more data. SAC, PPO, TD3, DQN. (Control analogy: direct adaptive control — no plant model, tune the controller from closed-loop data.)
Model-free — learn directly from the reward signal
The agent treats the environment as a black box. Observations and rewards flow in; actions flow out. No model of dynamics is ever built.
  • Model-based learns $\hat{P}$ and plans with it (MPC) or generates synthetic rollouts. Sample-efficient, but bias from the learned model can wreck the policy. PILCO, MBPO, Dreamer, learned-MPC. (Control analogy: an internal plant model — observer + MPC, with the observer learned from data instead of derived.)
Model-based — learn dynamics, plan against it
The agent trains a dynamics model from real experience, then rolls out imagined trajectories inside that model to plan without burning real samples.

Axis 2 — Value-based vs policy-based vs actor-critic

  • Value-based — learn $Q^\star$ where the policy is implicitly $\arg\max_a Q$. Breaks with continuous actions. Q-learning, DQN. (Control analogy: dynamic-programming cost-to-go table.)
Value-based — learn Q, pick the argmax
The agent learns a Q-value for every discrete action at every state. The policy isn't stored — it's computed on demand as the arg-max action. As Q updates, the highlighted best action shifts.
  • Policy-based — learn $\pi$ directly by gradient ascent on expected return. Works with any action space but suffers high variance. REINFORCE, vanilla policy gradient. (Control analogy: parameterize the control law and tune its parameters directly against closed-loop performance.)
Policy-based — learn π directly
The policy is a probability distribution over actions that the agent adjusts directly via gradient ascent on expected return. Works for continuous actions where an arg-max isn't meaningful.
  • Actor-critic — both. The critic ($Q$ or $V$) is used to reduce the variance of policy updates. A2C, PPO, SAC, TD3, DDPG. This is where everything modern lives. (Control analogy: controller + online cost estimator — one block decides, one block judges.)
Actor-critic — one network decides, one network judges
The actor picks actions; the critic estimates their value. The critic's TD error becomes the signal that trains the actor — a variance-reduced advantage rather than a noisy raw return.

Axis 3 — On-policy vs off-policy

  • On-policy algorithms can only learn from data generated by the current policy. You throw away rollouts after one gradient step. Stable but sample-inefficient. PPO, A2C, TRPO. (Control analogy: recursive least squares with a forgetting factor — only recent measurements count.)
On-policy — collect, update once, discard
Rollouts from the current policy fill a buffer, the policy updates on them once, and the buffer empties. When the policy changes, old data is invalid.
  • Off-policy algorithms can reuse old data via a replay buffer. 10–100× more sample-efficient, but trickier to stabilize. SAC, TD3, DDPG, DQN. (Control analogy: a persistent data logger feeding system identification — every past measurement is still useful.)
Off-policy — replay buffer keeps everything
Transitions accumulate in a persistent buffer. The current policy samples random past transitions for every update, reusing experience from many past policy versions.

For robotics specifically. Real-robot samples are expensive. You almost always want off-policy (to reuse data) and actor-critic (to handle continuous actions). That narrows the field to SAC, TD3, DDPG, and model-based methods — which is why those four dominate the robotics literature.

RL ≈ Adaptive Optimal Control. If you read those three axes as a control engineer, you’ve seen them before under different names. RL is, in essence, adaptive optimal control — the agent solves an optimal control problem whose dynamics and/or cost are not known analytically, so it has to learn them from measurements. The algorithmic families above are three orthogonal choices in how to run that adaptation: what to model, what to parameterize, and what data to use.

Quick-reference comparison

Algorithm Actions On/Off Sample eff. Robustness
Q-learning Discrete Off High OK
DQN Discrete Off Medium OK
REINFORCE Any On Low Poor
A2C / A3C Any On Low OK
PPO Any On Medium Excellent
DDPG Continuous Off High Fragile
TD3 Continuous Off High Good
SAC Continuous Off High Excellent
MBPO / Dreamer Continuous Off (MB) Very high Depends on model

Algorithm selector

Answer five questions. Get a recommendation with reasoning. Not a substitute for experimenting — but a solid starting point that will keep you from wasting a week on the wrong family of methods.

Question 1 of 5
What do your actions look like?
Question 2 of 5
How expensive is collecting data?
Question 3 of 5
How smooth / Markovian is the system?
Question 4 of 5
Do you need a deterministic policy?
Question 5 of 5
Safety constraints during training / deployment?
Recommendation
A starting point, not a verdict. Always A/B test against a simple baseline — random search, classic control, a hand-tuned PID — before committing.

Hyperparameters, debugging, and silent failures

These deserve their own post. See Practical RL Engineering: Hyperparameters, Debugging, and Silent Failures for the universal deep-RL defaults that almost always work, the order in which to debug a policy that won’t learn, and the traps that silently cost you weekends. It’s the most important page in the series for practical work.

When to stop RL and use classical control

RL is a tool, not a goal. Reach for it only when simpler methods have failed.

  • Linear-ish dynamics, quadratic cost — use LQR. Closed-form, optimal, beats anything you’d train with RL.
  • Known nonlinear model with constraints — use NMPC. It works, it’s understood, it gives guarantees.
  • Tracking with well-modeled disturbances — $H_\infty$ or robust MPC. RL gives you no guarantees here.
  • Tasks with decades of engineering maturity (PID for current loops, cascaded control for drives) — don’t use RL. RL shines where modeling is hard, rewards are complex, or the solution is non-convex. Not in places where you already have good analytical tools.

The smartest RL teams use classical control where it works and RL only where it adds something. A learned residual on top of a stabilizing controller is almost always safer and more sample-efficient than learning the whole thing end-to-end.

Further reading, curated

  • Sutton & Barto, Reinforcement Learning: An Introduction (2nd ed., 2018). The canonical textbook. Chapters 1–10 are essential; the rest is reference.
  • Spinning Up in Deep RL — OpenAI’s educational codebase. Read the essay; copy the algorithm implementations; use them as templates.
  • Stable-Baselines3 and CleanRL — production-grade implementations you can actually extend. CleanRL for single-file reference; SB3 for real projects.
  • “The 37 Implementation Details of PPO” (Huang et al.) — mandatory reading before writing PPO from scratch.
  • DreamerV3 and TD-MPC2 — the current state-of-the-art for “one algorithm that just works.”
  • Berkeley CS285 (Levine) — free lectures, rigorous, current.
  • Isaac Lab / Isaac Gym — where to train robotics policies fast.

Glossary of acronyms

Every post in this series uses a few of these. Bookmark this section — it’s the single reference for the whole field guide. Acronyms are grouped by what they are, not alphabetically, because that’s the grouping that actually helps remember them.

Foundational concepts.

  • MDP — Markov Decision Process. The formal object every RL algorithm operates on. Defined as the tuple $\langle \mathcal{S}, \mathcal{A}, P, R, \gamma \rangle$.
  • POMDP — Partially Observable MDP. An MDP where the agent sees an observation $o_t$ that is a noisy or incomplete function of the true state $s_t$.
  • CMDP — Constrained MDP. An MDP with additional cost functions that must stay below thresholds. The formalism for safe RL.
  • TD — Temporal Difference. The one-step bootstrap update rule that defines everything from SARSA to SAC. $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ is the TD error.
  • GAE — Generalized Advantage Estimation (Schulman et al., 2016). An exponentially-weighted sum of TD errors used as the advantage signal in PPO and most on-policy methods. The standard choice of $\lambda$ is 0.95.
  • HJB — Hamilton-Jacobi-Bellman equation. The continuous-time optimal-control PDE that the Bellman equation is the discrete-time, stochastic, reward-signed analog of.
  • MC — Monte Carlo. Methods that estimate values by averaging complete episode returns. High variance, zero bias.
  • KL — Kullback-Leibler divergence. Measure of how different two probability distributions are. Used to constrain how much a policy can change per update (TRPO, PPO).

Classical / value-based algorithms.

  • SARSA — State-Action-Reward-State-Action. On-policy TD control algorithm named for the five things its update rule uses.
  • DQN — Deep Q-Network (Mnih et al., 2015). The original deep RL success on Atari. Q-learning with a neural network, replay buffer, and target network.
  • Rainbow DQN — DQN with six additional improvements (Double Q, Dueling, Prioritized Experience Replay, Noisy Nets, Distributional RL, Multi-step returns) combined.
  • UCB — Upper Confidence Bound. Exploration strategy that picks actions by $\arg\max_a[\hat Q_a + c\sqrt{\ln t / N_a}]$. Provably logarithmic regret in bandits.
  • PUCT — Predictor + UCB applied to Trees. The tree-search formula from AlphaZero.

Policy-gradient and actor-critic algorithms.

  • REINFORCE — The original Monte-Carlo policy gradient algorithm (Williams, 1992). Named partly as an acronym: REward Increment = Nonnegative Factor × Offset Reinforcement × Characteristic Eligibility.
  • A2C — Advantage Actor-Critic. Synchronous version of A3C.
  • A3C — Asynchronous Advantage Actor-Critic (Mnih et al., 2016). Multi-worker on-policy actor-critic with asynchronous updates.
  • TRPO — Trust Region Policy Optimization (Schulman et al., 2015). Policy gradient method with a hard KL-divergence constraint.
  • PPO — Proximal Policy Optimization (Schulman et al., 2017). TRPO’s first-order replacement. The clipped-ratio surrogate is its defining trick. The simulation default.

Off-policy deterministic and stochastic methods.

  • DPG — Deterministic Policy Gradient (Silver et al., 2014). Underlying theorem for DDPG.
  • DDPG — Deep Deterministic Policy Gradient (Lillicrap et al., 2015). First widely-used deep RL for continuous control. Fragile; use TD3 or SAC instead.
  • TD3 — Twin Delayed Deep Deterministic policy gradient (Fujimoto et al., 2018). DDPG with three stability fixes: twin critics, delayed actor updates, target policy smoothing.
  • SAC — Soft Actor-Critic (Haarnoja et al., 2018). Actor-critic with maximum-entropy objective. The robotics default.
  • V-trace — Vectorized off-policy correction used in IMPALA. Truncated importance sampling for distributed on-policy-ish learning.
  • IMPALA — Importance-weighted Actor-Learner Architecture. Distributed actor-critic with V-trace.
  • MPO — Maximum a Posteriori Policy Optimization. KL-regularized actor-critic from DeepMind.

Model-based methods.

  • MPC — Model Predictive Control. The classical controller template; RL methods inherit or extend it.
  • MPPI — Model Predictive Path Integral (Williams et al., 2017). Sampling-based MPC. Exp-weighted over sampled trajectories.
  • CEM — Cross Entropy Method. Sampling-based optimization; iteratively refits a sampling distribution to top-performing samples.
  • iLQR — iterative Linear Quadratic Regulator. Gradient-based MPC that linearizes dynamics at each step.
  • PILCO — Probabilistic Inference for Learning COntrol (Deisenroth & Rasmussen, 2011). GP dynamics + analytic policy gradients. Absurdly sample-efficient.
  • GP — Gaussian Process. A distribution over functions. Used as a dynamics model in PILCO; see the GP post for details.
  • PETS — Probabilistic Ensembles with Trajectory Sampling (Chua et al., 2018). Ensemble dynamics + MPC. The canonical “MPC with learned dynamics” method.
  • MBPO — Model-Based Policy Optimization (Janner et al., 2019). Dyna-style — generate short imagined rollouts from a learned model, train SAC on real + imagined.
  • STEVE — Stochastic Ensemble Value Expansion. Model-based variant that weighs imagined rollouts by ensemble uncertainty.
  • Dreamer / DreamerV3 (Hafner et al., 2020, 2023) — World-model RL. Latent dynamics + actor-critic trained entirely in imagination. DreamerV3 is the fixed-hyperparameter variant.
  • TD-MPC / TD-MPC2 (Hansen et al., 2022, 2024) — Hybrid of learned world model, online MPPI planning, and an amortized policy. Current SOTA on many robotics benchmarks.

Exploration and intrinsic-motivation methods.

  • ε-greedy — With probability $\varepsilon$ pick a random action, else pick greedy. Simplest thing that works.
  • OU noise — Ornstein-Uhlenbeck noise. Temporally-correlated Gaussian noise originally used in DDPG exploration. Mostly deprecated in favor of plain Gaussian.
  • NoisyNets (Fortunato et al., 2017) — Replace linear layers with learned-noise variants. Parameter-space exploration.
  • ICM — Intrinsic Curiosity Module (Pathak et al., 2017). Prediction-error-based exploration bonus learned via inverse-dynamics features.
  • RND — Random Network Distillation (Burda et al., 2018). Prediction error against a fixed random network as the intrinsic reward. Dominated Montezuma’s Revenge.
  • NGU / Agent57 (Badia et al., 2020) — DeepMind’s exploration-rich agent that first achieved above-human on every Atari game.
  • HER — Hindsight Experience Replay (Andrychowicz et al., 2017). Relabel failed rollouts with the state they actually reached as the goal. Crucial for goal-conditioned sparse-reward tasks.

Imitation and offline RL.

  • BC — Behavioral Cloning. Supervised learning on expert $(s, a)$ pairs. First thing to try when you have demonstrations.
  • DAgger — Dataset Aggregation (Ross et al., 2011). Iterative BC with expert queries on states the learner visits.
  • GAIL — Generative Adversarial Imitation Learning (Ho & Ermon, 2016). Imitation as a two-player game against a discriminator.
  • IRL — Inverse Reinforcement Learning. Infer a reward function from expert behavior, then run forward RL on it.
  • CQL — Conservative Q-Learning (Kumar et al., 2020). Offline RL method that pushes Q-values down on out-of-distribution actions.
  • IQL — Implicit Q-Learning (Kostrikov et al., 2021). Offline RL that avoids evaluating the critic on OOD actions by using expectile regression.
  • TD3+BC (Fujimoto & Gu, 2021) — TD3 with an added BC loss term. Simplest offline RL that works.
  • OOD — Out-of-distribution. Refers to actions or states the policy encounters that weren’t in the training data. The central risk in offline RL.

Safety and robustness.

  • CPO — Constrained Policy Optimization (Achiam et al., 2017). Trust-region CMDP solver.
  • CBF — Control Barrier Function. Differentiable scalar $h(x) \geq 0$ defining a safe set; safety becomes a single-inequality constraint enforceable via a QP.
  • CVaR — Conditional Value at Risk. Expected return conditional on being in the worst $\alpha\%$ of outcomes. Risk-sensitive objective.
  • QP — Quadratic Program. The optimization problem class that CBF safety filters reduce to.

Robotics and practical engineering.

  • DoF / DOF — Degree(s) of Freedom. A 7-DOF arm has seven independent joint axes.
  • PD / PID — Proportional-Derivative / Proportional-Integral-Derivative controllers. The low-level feedback that RL setpoints often sit on top of.
  • LQR — Linear Quadratic Regulator. Closed-form optimal control for linear dynamics and quadratic cost. The classical benchmark.
  • NMPC — Nonlinear Model Predictive Control. MPC with nonlinear dynamics and/or costs.
  • DR — Domain Randomization. Randomize simulator physics parameters during training to produce a policy robust to sim-to-real gap.
  • SysID — System Identification. Measuring a real system’s parameters to calibrate a simulator or model.
  • NVH — Noise, Vibration, Harshness. Automotive acoustic/vibrational quality metric often included as a soft cost in motor-control RL.
  • PMSM — Permanent Magnet Synchronous Motor. The dominant electric-drive topology in automotive traction.
  • AFM — Axial Flux Motor. Alternative motor topology with axial rather than radial flux.
  • FOC — Field Oriented Control. The classical torque-control scheme for PMSMs.
  • BLDC — Brushless DC motor. Related topology; see the FOC post if it’s been written.

Infrastructure.

  • W&B — Weights & Biases. Experiment tracking platform.
  • SB3 — Stable-Baselines3. Popular PyTorch RL library.
  • MuJoCo / MJX — Multi-Joint dynamics with Contact / JAX port. Physics simulators.
  • Isaac Gym / Isaac Lab — NVIDIA’s GPU-accelerated robotics simulators.

The final rule

The best RL practitioners are people who would happily ship a PID controller if it worked. RL is a tool, not a goal. Measure it against the simplest thing that could possibly work, and only keep it if it wins.

This series goes deep into the methods that do win. The foundations post is next — and it starts, as every RL post should, with writing down the MDP.