Learning from Data — A Taxonomy

The shapes of learning,
from labels to agency.

Six paradigms define how machines acquire competence: from memorizing labeled examples, to discovering latent structure, to acting under consequence. This primer lays them out side-by-side — their mechanisms, their mathematics, and the problems each was born to solve.

By Majid Mazouchi

Paradigms06

Core EquationsIncluded

Application Domains30+

LevelIntermediate

Supervised Learning

Learning from labeled examples

Supervised learning is the best-understood corner of the field. Given a dataset of input-output pairs {(xᵢ, yᵢ)}, the learner searches for a function f : X → Y that minimizes the expected discrepancy between predicted and true labels. Every image-classifier, spam filter, and neural flux predictor you encounter is, underneath, solving this same problem.

Objective θ* = argmin_θ 𝔼_(x,y)∼𝒟 [ ℒ( f_θ(x), y ) ]

The choice of loss ℒ encodes what we mean by "good." Mean squared error for regression assumes Gaussian noise around the truth; cross-entropy for classification treats outputs as log-probabilities under a categorical distribution. Pick the wrong one and you're optimizing the wrong problem.

Two regimes: classification & regression

Classification predicts discrete labels — is this cell malignant, is this transaction fraud, which of ten digits is drawn. Regression predicts continuous values — tomorrow's temperature, a motor's flux linkage at a given (id, iq) operating point. The machinery is shared; only the output head and loss change.

The bias-variance bargain

Supervised learning succeeds when three conditions hold: the hypothesis class is expressive enough to contain a good f, there is enough data to identify it, and the test distribution resembles the training distribution. Violating the third — distribution shift — is how most supervised systems fail silently in production.

Linear / Logistic Regression SVM Random Forest Gradient Boosting MLP CNN Transformer

II.

Unsupervised Learning

Structure without supervision

Strip away the labels and a harder question remains: what does this data, on its own terms, want to tell us? Unsupervised learning asks the model to discover structure — clusters, manifolds, latent factors, densities — from x alone. The honest label for much of "AI" today is unsupervised or self-supervised: the internet is unlabeled, and we train on it anyway.

Four families

Clustering (K-means, GMM, DBSCAN) groups points by similarity. Dimensionality reduction (PCA, t-SNE, UMAP) finds low-dimensional coordinates for high-dimensional points. Density estimation (KDE, normalizing flows) models p(x) directly. Generative modeling (VAE, GAN, diffusion) goes one step further — it learns to sample from p(x), producing new data that looks like the old.

Autoencoder objective — compression as structure θ*, φ* = argmin_θ,φ 𝔼_x [ ‖x − g_φ(f_θ(x))‖² ]

Self-supervision: labels from the data itself

Modern language models and vision transformers are trained on contrived "supervised" problems where the labels come for free: predict the next token, predict masked pixels, predict the rotation. This is unsupervised learning dressed as supervised — and it is what powers foundation models.

Why it matters

Labels are expensive. A domain expert annotating motor current waveforms or CT scans costs real money per hour. Unsupervised methods extract value from the 99% of your data that will never be labeled — and often produce representations that transfer better than those trained supervised.

K-Means GMM DBSCAN PCA / ICA Autoencoder / VAE GAN Diffusion Normalizing Flows

III.

Reinforcement
Learning

Learning by consequence

Reinforcement learning is the one paradigm that matches the structure of an agent. At each timestep the learner observes state s, picks action a according to its policy π(a|s), receives a scalar reward r, and transitions to a new state. The goal is to find the policy that maximizes expected cumulative reward — the return.

Bellman optimality — the core identity Q*(s, a) = 𝔼[ r + γ · max_a′ Q*(s′, a′) ]

That single equation underwrites most of RL. It says: the value of taking action a in state s equals the immediate reward plus the discounted value of acting optimally thereafter. Every algorithm below is, in some sense, a different numerical strategy for solving it.

Three algorithmic lineages

Value-based methods (Q-learning, DQN, Rainbow) estimate Q* directly and derive the policy as the argmax. They dominate discrete-action problems. Policy-gradient methods (REINFORCE, TRPO, PPO) parameterize π directly and ascend ∇_θJ(π_θ). Actor-critic (A3C, SAC, TD3) learns both — a policy (actor) and a value function (critic) — using the critic to reduce the variance of policy-gradient estimates.

The exploration problem

RL is hard because the agent only sees rewards for actions it takes. To discover that a given action is good, it must first try it. Balancing exploration (try new things) against exploitation (do what already works) is a recurring theme — addressed by ε-greedy, Boltzmann sampling, entropy bonuses, intrinsic motivation, and a hundred other tricks.

Why it's painful

Unlike supervised learning, there is no i.i.d. dataset sitting on disk. The agent generates its own training data through interaction, and that distribution shifts as the policy improves. Sample efficiency is poor, credit assignment over long horizons is brutal, and reward design is its own dark art.

Q-Learning DQN REINFORCE PPO SAC TD3 A3C / A2C MCTS Model-based (Dreamer, MuZero)

IV.

Imitation Learning

When the reward is unknown, copy the expert

Rewards are hard. Try to hand-write a reward function for "drive like a human" — safe, comfortable, polite, not too slow, not too fast, yields correctly at a four-way stop. You will fail, and the agent will learn some pathological loophole you didn't anticipate. Imitation learning sidesteps this by replacing the reward with demonstrations from an expert.

Three flavors, increasing in sophistication

Behavioral Cloning (BC) treats the demonstration dataset {(s, a*)} as a supervised learning problem: predict the expert action from the state. Simple, fast, and fragile — at deployment the policy encounters states slightly off the expert's trajectory and compounds errors (the covariate shift problem).

Inverse Reinforcement Learning (IRL) is more ambitious: infer the reward function that would make the expert's behavior optimal, then run standard RL against that inferred reward. It recovers the intent behind the demos, not just the action.

Adversarial imitation (GAIL, AIRL) frames imitation as a two-player game: a discriminator tries to tell apart expert and agent trajectories, and the policy is trained to fool it. This inherits the strengths of GANs — and their instability.

Behavioral cloning — the simplest version π* = argmin_π 𝔼_{(s,a*)∼𝒟_expert} [ ℒ( π(s), a* ) ]

Where imitation beats RL

When experts exist and are easier to record than to simulate, imitation can cut training time by orders of magnitude. Self-driving fleets harvest human driving data continuously; surgical robots are seeded with recordings of expert surgeons; robot manipulation uses teleoperation to bootstrap policies before fine-tuning with RL.

Transfer Learning

Reusing what was already learned

Transfer learning is the field's answer to a practical observation: training a deep model from scratch on your specific problem is usually wasteful. If someone has already trained a model on a related, data-rich source task, their learned representations are likely useful starting points for yours. Transfer is less a single algorithm and more a doctrine — don't start from zero if you don't have to.

A spectrum of techniques

Feature extraction treats the pretrained network as a frozen encoder and trains only a new head on top. Fast, cheap, and the right choice when the target dataset is small. Fine-tuning unfreezes some or all of the pretrained weights and continues training on the target task — giving the model room to specialize. Domain adaptation aligns representations across distributions when you have lots of labeled source data and unlabeled (or scarce-labeled) target data.

Sim-to-real transfer is the robotics special case: train a policy in simulation where data is free, then deploy on hardware. Parameter-efficient fine-tuning (LoRA, adapters, prompt tuning) is the LLM special case: adapt enormous pretrained models with a fraction of the parameters.

Foundation models, in a sentence

The modern story of AI is transfer learning at planetary scale. A single giant model (GPT, Llama, Claude, CLIP, SAM) is pretrained once on essentially all available data, then every downstream task — code, translation, classification, agentic control — is approached by adapting that pretrained model rather than training fresh. The economics of this approach have reshaped the entire field.

ImageNet Pretraining Fine-tuning LoRA / Adapters Domain Adversarial Training CLIP / SAM Sim-to-Real Meta-learning (MAML)

Chapter VI — A Critical Distinction

Online vs Offline RL: who gets to touch the environment?

The line that matters most in practical reinforcement learning isn't algorithmic — it's about data access. Can the agent interact with the real environment during training, or must it learn entirely from a fixed, pre-collected dataset? The answer reshapes every design choice that follows.

Regime A

Online RL

The classical setting. The agent acts in the environment, observes consequences, and uses fresh experience to update its policy — repeatedly, for millions of steps.

DataGenerated on the fly by current policy

FeedbackImmediate; closed loop

CostLow if simulator exists; prohibitive otherwise

RiskExploration can damage hardware or people

AlgorithmsPPO, SAC, TD3, DQN, A3C

Best forGames, simulated robotics, digital systems with cheap rollouts

Regime B

Offline RL

Also called batch RL. The agent is given a fixed dataset of logged transitions — collected by some prior policy, possibly a human — and must extract the best policy it can without any further interaction.

DataFixed dataset, no new rollouts permitted

FeedbackNone during training

CostOne-time data collection; reuses existing logs

RiskDistributional shift — policy drifts off-support

AlgorithmsCQL, BCQ, IQL, TD3+BC, AWAC

Best forHealthcare, autonomous driving, industrial control

§ The distributional shift problem

In offline RL, naively applying online algorithms like DQN or SAC fails catastrophically. The learned Q-function becomes optimistic about actions that weren't in the dataset, because no counter-evidence exists. The policy picks those phantom-optimal actions, and at deployment produces nonsense. Modern offline algorithms (CQL, IQL, BCQ) solve this by penalizing out-of-distribution actions or constraining the policy to stay close to the data-generating distribution.

§ Choosing between them

Use online RL when a fast, accurate simulator exists or the environment is a digital system where mistakes are cheap. Use offline RL when interaction is expensive, slow, or dangerous — a car on a public road, a patient on a drug-dosing protocol, a chemical plant in operation. Many real deployments are hybrid: pretrain offline on logs, fine-tune online once a safe baseline is established.

A comparison, at a glance.

Fig. 02 — the paradigms on six axes

	Input signal	Learning goal	Data needed	Failure mode	Canonical win
Supervised	(x, y) pairs	Predict y from x	Labeled & large	Distribution shift at deployment	Image classification, forecasting
Unsupervised	x only	Discover structure in x	Unlabeled & large	Structure found is not useful	Foundation model pretraining
Reinforcement	State, action, reward	Maximize expected return	Environment access	Reward hacking, poor sample efficiency	Game-playing, adaptive control
Imitation	Expert demonstrations	Reproduce expert behavior	Demo trajectories	Covariate shift, compounding errors	Autonomous driving, teleop robots
Transfer	Source + target task data	Adapt prior knowledge	Pretrained model + small target set	Negative transfer when source differs	LLM fine-tuning, sim-to-real
Online RL	Live interaction	Optimize via exploration	Fast, safe environment	Exploration damages hardware	Simulated robotics, games
Offline RL	Logged transitions	Policy from fixed data	Large behavioral dataset	Distributional shift, OOD actions	Healthcare, industrial control