Model-Based RL and MPC Hybrids

This is Post 6 in the field guide to reinforcement learning for control and robotics, and the one I most wanted to write. Previous posts worked through MDPs and the Bellman equations, TD learning, policy gradients, and the modern model-free toolkit — SAC, PPO, TD3, DDPG. Every algorithm we’ve covered learns directly from reward signals, without any model of the environment. This post is about the other half of the field: the methods that do build a model, and often do it while sitting on top of the control infrastructure you already trust.

If you’re a control engineer, this section is written for you. Model-based RL is the family that most resembles classical control — learn a dynamics model, then plan or compute a policy against it. It’s dramatically more sample-efficient than model-free methods, which matters enormously when every rollout takes a real dyno run or a real robot trajectory. It also composes naturally with MPC, which is probably already in your codebase. And the failure modes that plague it are diagnosable in ways that model-free RL is not.

The core trade-off

The difference between model-free and model-based RL is essentially the difference between “keep trying and remember what worked” and “build a simulator in your head and think ahead.”

	Model-free	Model-based
Samples needed	$10^6$ — $10^9$	$10^3$ — $10^5$
Asymptotic performance	Usually higher	Capped by model quality
Failure mode	Doesn’t learn	Policy exploits model errors
Safety	Hard to constrain	Easy (plan with constraints)
Good at	Complex, hard-to-model systems	Smooth, predictable dynamics

The sample-efficiency gap is usually three orders of magnitude. PILCO (1970s-style Gaussian-process dynamics + analytical gradients) can learn cartpole swing-up in roughly 15 seconds of real interaction data. SAC on the same task needs hundreds of thousands of steps. The trade-off is that model-based methods are capped by model quality — if your dynamics model is wrong in some regime, the policy trained against it will be wrong there too, even after it’s seen data. Model-free methods have no such ceiling; they’ll keep learning from rewards long after a model-based method has plateaued.

For real hardware, the trade-off is almost always worth taking. For massively parallel simulation, it’s usually not.

The four families of model-based RL

Modern model-based RL divides into four approaches that share a structure but differ in how they use the learned model.

Dyna-style — use the model to generate extra experience

Examples: MBPO, STEVE.

The simplest idea: learn a dynamics model $\hat f$, use it to generate imagined rollouts, mix those imagined transitions with real ones in the replay buffer, and train an off-policy model-free algorithm (usually SAC) on the augmented buffer. You get SAC’s asymptotic performance and robustness, but with far less real data — because most of the data the critic and policy see was synthesized from the model.

The trick is keeping imagined rollouts short. MBPO uses horizons as short as 1 to 5 steps. Errors in $\hat f$ compound exponentially, so long imagined rollouts quickly become fiction. Short rollouts stay close to the data distribution and remain useful. Dyna-style methods give you a clean sample-efficiency win without changing the downstream learner.

Planning (MPC) — use the model to think ahead

Examples: CEM, MPPI, iLQR, and the canonical PETS algorithm.

This is the entry point most control engineers recognize. Learn a dynamics model, then at each control step solve a short-horizon optimization online: 10 to 50 steps ahead, sample or roll out candidate action sequences, pick the one with the best predicted return, execute the first action, repeat. There’s no policy network in the usual sense — the “policy” is the re-planning loop itself.

A sketch of what this looks like for a CEM- or MPPI-style planner with a neural-network dynamics model:

def mpc_step(s_now, f_hat, reward_fn, horizon=20):
    # 1. Sample N candidate action sequences
    candidates = sample_actions(N=500, H=horizon)
    returns = []
    for seq in candidates:
        s, R = s_now, 0.0
        for t in range(horizon):
            R += reward_fn(s, seq[t]) * gamma**t
            s = f_hat(s, seq[t])          # learned dynamics
        returns.append(R)
    # 2. Refit sampling distribution
    #    CEM: fit Gaussian to top-k; MPPI: exp-weight by return
    # 3. Execute the first action of the best sequence
    return best_first_action(candidates, returns)

Extremely data-efficient, composable with hard constraints (the optimizer just rejects constraint-violating sequences), and compatible with whatever safety machinery you already have. The cost is compute at deployment — CEM with 500 samples × 20 horizon = 10,000 model evaluations per control step. Whether that’s acceptable depends on your control rate and available hardware. For a dyno running at 100 Hz with a GPU nearby, it’s fine. For an embedded TriCore running a motor-control loop at 20 kHz, it isn’t.

World models — learn a latent-state simulator, train the policy inside it

Examples: Dreamer, DreamerV3.

The idea: learn a recurrent latent-state dynamics model from raw observations (often pixels), then train an actor-critic entirely inside the imagined latent rollouts, never on real data during policy updates. This handles images, long horizons, and partial observability in a way that neither Dyna-style nor planning does cleanly.

DreamerV3 in particular set state-of-the-art on many benchmarks with a single set of fixed hyperparameters that transfer across domains. If you have vision-based observations or genuinely partial state, this is the family to reach for. The implementation complexity is higher than MBPO or PETS, though — you’re training three coupled networks (representation, dynamics, actor-critic) with delicate loss balancing.

Gaussian processes — the classic extreme of sample-efficiency

Example: PILCO (Deisenroth & Rasmussen, 2011).

Worth mentioning because it’s beautiful and because it connects to the Gaussian Processes post on this blog. Use a GP as the dynamics model, which gives you calibrated uncertainty for free. Propagate that uncertainty through the policy using moment matching, yielding analytic policy gradients. The result is absurdly sample-efficient on small problems — cartpole swing-up in roughly 15 seconds of real data is the famous demo.

The limitation: GP inference is $O(n^3)$ in the number of training points, so PILCO doesn’t scale past roughly 20-dimensional states or a few hundred data points. For what it is, nothing beats it. For what it isn’t — high-dimensional, long-horizon, pixel-based — it’s the wrong tool.

The pattern that matters most for control engineers — MPC with learned dynamics

If you already have MPC in production, you probably don’t need to rebuild your infrastructure to use RL. You just need to replace your analytical model with a learned one.

This is the most underappreciated path into RL for control teams. You keep your existing solver, your constraint handling, your safety certification story, your real-time scheduling — everything. You swap out $f_{\text{analytical}}$ for $\hat f_{\text{learned}}$, which you train offline on rollout data you’ve been logging anyway. The learned model captures effects your analytical model misses: friction, backlash, thermal drift, actuator nonlinearities, everything in the error signal between your model and reality.

Two concrete patterns work well in practice:

Residual dynamics. Write $\hat f(s, a) = f_{\text{analytical}}(s, a) + \Delta_\phi(s, a)$ where $\Delta_\phi$ is a neural network trained to predict the residual between your analytical model and reality. The network has an easy job — it only needs to learn what the analytical model got wrong — and the MPC solver still has the inductive bias of your original model structure. This is a recurring motif in recent papers on learned-MPC for autonomous driving, legged robots, and motor control.

Ensemble dynamics with uncertainty gating. Train 5 to 7 independently-initialized dynamics networks on the same data (with bootstrap sampling if you’re being rigorous). The disagreement across the ensemble is an uncertainty estimate. In the MPC objective, penalize states where the ensemble disagrees, or refuse to plan trajectories that pass through high-uncertainty regions. This is how PETS stays honest: the planner literally avoids the parts of state space where the model doesn’t know enough to be trusted.

The model-exploitation problem, and how to handle it

The failure mode that ruins every naive model-based method: the policy finds regions where $\hat f$ is inaccurate but predicts high reward, and exploits them mercilessly. The learned-dynamics network is just a function approximator — query it outside its training distribution and it will confidently return garbage. The planner is an optimizer and has no way of knowing that garbage isn’t real.

Three mitigations, in roughly the order you’d try them:

Limit imagined rollout length. MBPO horizons of 1 to 5 are not a compromise, they’re a defense. Errors compound exponentially; short is safe. If you’re doing Dyna-style, start at horizon 1 and only extend if you see clear gains.

Use ensembles. 5 to 7 independently-trained dynamics networks. When their predictions disagree, you’re outside the training distribution. Penalize the policy for visiting high-disagreement regions, or average their predictions, or reject trajectories whose ensemble variance exceeds a threshold. Ensembles are the single most effective technique I know in model-based RL — they change the failure mode from “policy silently converges to wrong thing” to “policy refuses to plan into unknown regions.”

Model-predictive correction. Use the learned policy as a warm-start for MPC, and re-plan online. The policy gives you a fast approximate answer; MPC refines it under the current dynamics estimate and current constraints. This is exactly the pattern in TD-MPC and TD-MPC2, which are currently state-of-the-art on many robotics benchmarks. It’s also extremely natural to adopt incrementally — you start with pure MPC, add a learned policy to warm-start the optimizer, add a value function as a terminal cost, and you’ve smoothly migrated from classical control to RL without a rewrite.

When to reach for model-based

The honest calculus: how expensive is one real rollout?

If one real rollout costs less than a minute of engineer time — fast sim, easy reset, no risk — model-free PPO or SAC with massive parallelism will usually win end-to-end. You can afford the samples.
If one real rollout costs more than a minute of engineer time — physical hardware, slow reset, risk of damage, expensive sim — model-based almost always wins. Sample efficiency dominates the economics.

The crossover roughly matches your day-to-day reality. A motor-control dyno run, a quadruped on a real floor, a manipulator with parts to scratch — model-based. A MuJoCo or Isaac Gym environment that runs 100k steps per second on a GPU — model-free.

For pragmatic single-algorithm choices in 2026: TD-MPC2 if you want one method that handles contact and complex dynamics with modest data, DreamerV3 if you’re working from pixels, MBPO if you already have SAC working and want a sample-efficiency boost without changing too much infrastructure, and residual-dynamics MPC if you’re a control team with existing MPC in production and want the minimal-risk path in.

What’s worth taking away

Three things.

One — model-based RL is closer to classical control than to model-free RL. If you’re already comfortable with MPC, the mental model is almost unchanged; you’ve just replaced hand-derived dynamics with learned ones. The algorithmic complexity is not in “how do I do RL” but in “how do I learn a dynamics model I can trust.” The latter is a problem control engineers are already equipped to think about — identifiability, generalization, uncertainty — even if the tooling (neural nets, ensembles) is newer than the concepts.

Two — the sample-efficiency gap is real and large. On real hardware where samples cost real time or money, model-based methods can genuinely be the difference between “we can train this” and “we can’t.” Don’t underestimate this. A lot of “RL doesn’t work for control” takes are really “model-free RL doesn’t work for control, and we didn’t try anything else.”

Three — the model-exploitation problem is the one to design around. Every model-based failure I’ve seen in production was some version of “the planner found a region of state space where the model was wrong and confidently exploited it.” Ensemble-based uncertainty and short rollouts defuse this almost entirely. Plan like you don’t trust your model, because you shouldn’t.

The next and final post in this series looks at the engineering realities that sit on top of any RL algorithm — sim-to-real transfer, offline learning from logged data, and safe RL. These are the topics no one teaches in school, and they’re usually the reason a project ships or doesn’t.

← Back to the field guide