Six paradigms define how machines acquire competence: from memorizing labeled examples, to discovering latent structure, to acting under consequence. This primer lays them out side-by-side — their mechanisms, their mathematics, and the problems each was born to solve.
Supervised learning is the best-understood corner of the field. Given a dataset of input-output pairs {(xᵢ, yᵢ)}, the learner searches for a function f : X → Y that minimizes the expected discrepancy between predicted and true labels. Every image-classifier, spam filter, and neural flux predictor you encounter is, underneath, solving this same problem.
The choice of loss ℒ encodes what we mean by "good." Mean squared error for regression assumes Gaussian noise around the truth; cross-entropy for classification treats outputs as log-probabilities under a categorical distribution. Pick the wrong one and you're optimizing the wrong problem.
Classification predicts discrete labels — is this cell malignant, is this transaction fraud, which of ten digits is drawn. Regression predicts continuous values — tomorrow's temperature, a motor's flux linkage at a given (id, iq) operating point. The machinery is shared; only the output head and loss change.
Supervised learning succeeds when three conditions hold: the hypothesis class is expressive enough to contain a good f, there is enough data to identify it, and the test distribution resembles the training distribution. Violating the third — distribution shift — is how most supervised systems fail silently in production.
Strip away the labels and a harder question remains: what does this data, on its own terms, want to tell us? Unsupervised learning asks the model to discover structure — clusters, manifolds, latent factors, densities — from x alone. The honest label for much of "AI" today is unsupervised or self-supervised: the internet is unlabeled, and we train on it anyway.
Clustering (K-means, GMM, DBSCAN) groups points by similarity. Dimensionality reduction (PCA, t-SNE, UMAP) finds low-dimensional coordinates for high-dimensional points. Density estimation (KDE, normalizing flows) models p(x) directly. Generative modeling (VAE, GAN, diffusion) goes one step further — it learns to sample from p(x), producing new data that looks like the old.
Modern language models and vision transformers are trained on contrived "supervised" problems where the labels come for free: predict the next token, predict masked pixels, predict the rotation. This is unsupervised learning dressed as supervised — and it is what powers foundation models.
Labels are expensive. A domain expert annotating motor current waveforms or CT scans costs real money per hour. Unsupervised methods extract value from the 99% of your data that will never be labeled — and often produce representations that transfer better than those trained supervised.
Reinforcement learning is the one paradigm that matches the structure of an agent. At each timestep the learner observes state s, picks action a according to its policy π(a|s), receives a scalar reward r, and transitions to a new state. The goal is to find the policy that maximizes expected cumulative reward — the return.
That single equation underwrites most of RL. It says: the value of taking action a in state s equals the immediate reward plus the discounted value of acting optimally thereafter. Every algorithm below is, in some sense, a different numerical strategy for solving it.
Value-based methods (Q-learning, DQN, Rainbow) estimate Q* directly and derive the policy as the argmax. They dominate discrete-action problems. Policy-gradient methods (REINFORCE, TRPO, PPO) parameterize π directly and ascend ∇θJ(πθ). Actor-critic (A3C, SAC, TD3) learns both — a policy (actor) and a value function (critic) — using the critic to reduce the variance of policy-gradient estimates.
RL is hard because the agent only sees rewards for actions it takes. To discover that a given action is good, it must first try it. Balancing exploration (try new things) against exploitation (do what already works) is a recurring theme — addressed by ε-greedy, Boltzmann sampling, entropy bonuses, intrinsic motivation, and a hundred other tricks.
Unlike supervised learning, there is no i.i.d. dataset sitting on disk. The agent generates its own training data through interaction, and that distribution shifts as the policy improves. Sample efficiency is poor, credit assignment over long horizons is brutal, and reward design is its own dark art.
Rewards are hard. Try to hand-write a reward function for "drive like a human" — safe, comfortable, polite, not too slow, not too fast, yields correctly at a four-way stop. You will fail, and the agent will learn some pathological loophole you didn't anticipate. Imitation learning sidesteps this by replacing the reward with demonstrations from an expert.
Behavioral Cloning (BC) treats the demonstration dataset {(s, a*)} as a supervised learning problem: predict the expert action from the state. Simple, fast, and fragile — at deployment the policy encounters states slightly off the expert's trajectory and compounds errors (the covariate shift problem).
Inverse Reinforcement Learning (IRL) is more ambitious: infer the reward function that would make the expert's behavior optimal, then run standard RL against that inferred reward. It recovers the intent behind the demos, not just the action.
Adversarial imitation (GAIL, AIRL) frames imitation as a two-player game: a discriminator tries to tell apart expert and agent trajectories, and the policy is trained to fool it. This inherits the strengths of GANs — and their instability.
When experts exist and are easier to record than to simulate, imitation can cut training time by orders of magnitude. Self-driving fleets harvest human driving data continuously; surgical robots are seeded with recordings of expert surgeons; robot manipulation uses teleoperation to bootstrap policies before fine-tuning with RL.
Transfer learning is the field's answer to a practical observation: training a deep model from scratch on your specific problem is usually wasteful. If someone has already trained a model on a related, data-rich source task, their learned representations are likely useful starting points for yours. Transfer is less a single algorithm and more a doctrine — don't start from zero if you don't have to.
Feature extraction treats the pretrained network as a frozen encoder and trains only a new head on top. Fast, cheap, and the right choice when the target dataset is small. Fine-tuning unfreezes some or all of the pretrained weights and continues training on the target task — giving the model room to specialize. Domain adaptation aligns representations across distributions when you have lots of labeled source data and unlabeled (or scarce-labeled) target data.
Sim-to-real transfer is the robotics special case: train a policy in simulation where data is free, then deploy on hardware. Parameter-efficient fine-tuning (LoRA, adapters, prompt tuning) is the LLM special case: adapt enormous pretrained models with a fraction of the parameters.
The modern story of AI is transfer learning at planetary scale. A single giant model (GPT, Llama, Claude, CLIP, SAM) is pretrained once on essentially all available data, then every downstream task — code, translation, classification, agentic control — is approached by adapting that pretrained model rather than training fresh. The economics of this approach have reshaped the entire field.
The line that matters most in practical reinforcement learning isn't algorithmic — it's about data access. Can the agent interact with the real environment during training, or must it learn entirely from a fixed, pre-collected dataset? The answer reshapes every design choice that follows.
The classical setting. The agent acts in the environment, observes consequences, and uses fresh experience to update its policy — repeatedly, for millions of steps.
Also called batch RL. The agent is given a fixed dataset of logged transitions — collected by some prior policy, possibly a human — and must extract the best policy it can without any further interaction.
In offline RL, naively applying online algorithms like DQN or SAC fails catastrophically. The learned Q-function becomes optimistic about actions that weren't in the dataset, because no counter-evidence exists. The policy picks those phantom-optimal actions, and at deployment produces nonsense. Modern offline algorithms (CQL, IQL, BCQ) solve this by penalizing out-of-distribution actions or constraining the policy to stay close to the data-generating distribution.
Use online RL when a fast, accurate simulator exists or the environment is a digital system where mistakes are cheap. Use offline RL when interaction is expensive, slow, or dangerous — a car on a public road, a patient on a drug-dosing protocol, a chemical plant in operation. Many real deployments are hybrid: pretrain offline on logs, fine-tune online once a safe baseline is established.
| Input signal | Learning goal | Data needed | Failure mode | Canonical win | |
|---|---|---|---|---|---|
| Supervised | (x, y) pairs | Predict y from x | Labeled & large | Distribution shift at deployment | Image classification, forecasting |
| Unsupervised | x only | Discover structure in x | Unlabeled & large | Structure found is not useful | Foundation model pretraining |
| Reinforcement | State, action, reward | Maximize expected return | Environment access | Reward hacking, poor sample efficiency | Game-playing, adaptive control |
| Imitation | Expert demonstrations | Reproduce expert behavior | Demo trajectories | Covariate shift, compounding errors | Autonomous driving, teleop robots |
| Transfer | Source + target task data | Adapt prior knowledge | Pretrained model + small target set | Negative transfer when source differs | LLM fine-tuning, sim-to-real |
| Online RL | Live interaction | Optimize via exploration | Fast, safe environment | Exploration damages hardware | Simulated robotics, games |
| Offline RL | Logged transitions | Policy from fixed data | Large behavioral dataset | Distributional shift, OOD actions | Healthcare, industrial control |