Field Guide № 01
← Autonomy Paradigms of Machine Learning A Technical Primer
Learning from Data — A Taxonomy

The shapes of learning,
from labels to agency.

Six paradigms define how machines acquire competence: from memorizing labeled examples, to discovering latent structure, to acting under consequence. This primer lays them out side-by-side — their mechanisms, their mathematics, and the problems each was born to solve.

By Majid Mazouchi

Paradigms06
Core EquationsIncluded
Application Domains30+
LevelIntermediate
Ch. 01
Supervised
Learning
Learning f : X → Y from labeled pairs.
Ch. 02
Unsupervised
Learning
Discovering structure without labels.
Ch. 03
Reinforcement
Learning
Optimizing policies through reward.
Ch. 04
Imitation
Learning
Behavior shaped by expert demonstration.
Ch. 05
Transfer
Learning
Reusing knowledge across tasks and domains.
Ch. 06
Online vs
Offline RL
Interaction versus static data regimes.
The Landscape

One discipline, many signals.

MACHINE LEARNING SUPER- VISED UNSUPER- VISED REIN- FORCEMENT IMITATION TRANSFER Classification Regression Clustering Gen. Models Value-based Policy-based BC / IRL / GAIL Fine-tune / Domain Adapt. RL BRANCHES INTO → ONLINE vs OFFLINE (BATCH) — environment access or fixed dataset
I.

Supervised Learning

Learning from labeled examples

Supervised learning is the best-understood corner of the field. Given a dataset of input-output pairs {(xᵢ, yᵢ)}, the learner searches for a function f : X → Y that minimizes the expected discrepancy between predicted and true labels. Every image-classifier, spam filter, and neural flux predictor you encounter is, underneath, solving this same problem.

Objective θ* = argminθ 𝔼(x,y)∼𝒟 [ ℒ( fθ(x), y ) ]

The choice of loss ℒ encodes what we mean by "good." Mean squared error for regression assumes Gaussian noise around the truth; cross-entropy for classification treats outputs as log-probabilities under a categorical distribution. Pick the wrong one and you're optimizing the wrong problem.

Two regimes: classification & regression

Classification predicts discrete labels — is this cell malignant, is this transaction fraud, which of ten digits is drawn. Regression predicts continuous values — tomorrow's temperature, a motor's flux linkage at a given (id, iq) operating point. The machinery is shared; only the output head and loss change.

The bias-variance bargain

Supervised learning succeeds when three conditions hold: the hypothesis class is expressive enough to contain a good f, there is enough data to identify it, and the test distribution resembles the training distribution. Violating the third — distribution shift — is how most supervised systems fail silently in production.

Linear / Logistic Regression SVM Random Forest Gradient Boosting MLP CNN Transformer
II.

Unsupervised Learning

Structure without supervision

Strip away the labels and a harder question remains: what does this data, on its own terms, want to tell us? Unsupervised learning asks the model to discover structure — clusters, manifolds, latent factors, densities — from x alone. The honest label for much of "AI" today is unsupervised or self-supervised: the internet is unlabeled, and we train on it anyway.

Four families

Clustering (K-means, GMM, DBSCAN) groups points by similarity. Dimensionality reduction (PCA, t-SNE, UMAP) finds low-dimensional coordinates for high-dimensional points. Density estimation (KDE, normalizing flows) models p(x) directly. Generative modeling (VAE, GAN, diffusion) goes one step further — it learns to sample from p(x), producing new data that looks like the old.

Autoencoder objective — compression as structure θ*, φ* = argminθ,φ 𝔼x [ ‖x − gφ(fθ(x))‖² ]

Self-supervision: labels from the data itself

Modern language models and vision transformers are trained on contrived "supervised" problems where the labels come for free: predict the next token, predict masked pixels, predict the rotation. This is unsupervised learning dressed as supervised — and it is what powers foundation models.

Why it matters

Labels are expensive. A domain expert annotating motor current waveforms or CT scans costs real money per hour. Unsupervised methods extract value from the 99% of your data that will never be labeled — and often produce representations that transfer better than those trained supervised.

K-Means GMM DBSCAN PCA / ICA Autoencoder / VAE GAN Diffusion Normalizing Flows
§   §   §
III.

Reinforcement
Learning

Learning by consequence

Reinforcement learning is the one paradigm that matches the structure of an agent. At each timestep the learner observes state s, picks action a according to its policy π(a|s), receives a scalar reward r, and transitions to a new state. The goal is to find the policy that maximizes expected cumulative reward — the return.

Bellman optimality — the core identity Q*(s, a) = 𝔼[ r + γ · maxa′ Q*(s′, a′) ]

That single equation underwrites most of RL. It says: the value of taking action a in state s equals the immediate reward plus the discounted value of acting optimally thereafter. Every algorithm below is, in some sense, a different numerical strategy for solving it.

Three algorithmic lineages

Value-based methods (Q-learning, DQN, Rainbow) estimate Q* directly and derive the policy as the argmax. They dominate discrete-action problems. Policy-gradient methods (REINFORCE, TRPO, PPO) parameterize π directly and ascend ∇θJ(πθ). Actor-critic (A3C, SAC, TD3) learns both — a policy (actor) and a value function (critic) — using the critic to reduce the variance of policy-gradient estimates.

The exploration problem

RL is hard because the agent only sees rewards for actions it takes. To discover that a given action is good, it must first try it. Balancing exploration (try new things) against exploitation (do what already works) is a recurring theme — addressed by ε-greedy, Boltzmann sampling, entropy bonuses, intrinsic motivation, and a hundred other tricks.

Why it's painful

Unlike supervised learning, there is no i.i.d. dataset sitting on disk. The agent generates its own training data through interaction, and that distribution shifts as the policy improves. Sample efficiency is poor, credit assignment over long horizons is brutal, and reward design is its own dark art.

Q-Learning DQN REINFORCE PPO SAC TD3 A3C / A2C MCTS Model-based (Dreamer, MuZero)
IV.

Imitation Learning

When the reward is unknown, copy the expert

Rewards are hard. Try to hand-write a reward function for "drive like a human" — safe, comfortable, polite, not too slow, not too fast, yields correctly at a four-way stop. You will fail, and the agent will learn some pathological loophole you didn't anticipate. Imitation learning sidesteps this by replacing the reward with demonstrations from an expert.

Three flavors, increasing in sophistication

Behavioral Cloning (BC) treats the demonstration dataset {(s, a*)} as a supervised learning problem: predict the expert action from the state. Simple, fast, and fragile — at deployment the policy encounters states slightly off the expert's trajectory and compounds errors (the covariate shift problem).

Inverse Reinforcement Learning (IRL) is more ambitious: infer the reward function that would make the expert's behavior optimal, then run standard RL against that inferred reward. It recovers the intent behind the demos, not just the action.

Adversarial imitation (GAIL, AIRL) frames imitation as a two-player game: a discriminator tries to tell apart expert and agent trajectories, and the policy is trained to fool it. This inherits the strengths of GANs — and their instability.

Behavioral cloning — the simplest version π* = argminπ 𝔼(s,a*)∼𝒟expert [ ℒ( π(s), a* ) ]

Where imitation beats RL

When experts exist and are easier to record than to simulate, imitation can cut training time by orders of magnitude. Self-driving fleets harvest human driving data continuously; surgical robots are seeded with recordings of expert surgeons; robot manipulation uses teleoperation to bootstrap policies before fine-tuning with RL.

V.

Transfer Learning

Reusing what was already learned

Transfer learning is the field's answer to a practical observation: training a deep model from scratch on your specific problem is usually wasteful. If someone has already trained a model on a related, data-rich source task, their learned representations are likely useful starting points for yours. Transfer is less a single algorithm and more a doctrine — don't start from zero if you don't have to.

A spectrum of techniques

Feature extraction treats the pretrained network as a frozen encoder and trains only a new head on top. Fast, cheap, and the right choice when the target dataset is small. Fine-tuning unfreezes some or all of the pretrained weights and continues training on the target task — giving the model room to specialize. Domain adaptation aligns representations across distributions when you have lots of labeled source data and unlabeled (or scarce-labeled) target data.

Sim-to-real transfer is the robotics special case: train a policy in simulation where data is free, then deploy on hardware. Parameter-efficient fine-tuning (LoRA, adapters, prompt tuning) is the LLM special case: adapt enormous pretrained models with a fraction of the parameters.

Foundation models, in a sentence

The modern story of AI is transfer learning at planetary scale. A single giant model (GPT, Llama, Claude, CLIP, SAM) is pretrained once on essentially all available data, then every downstream task — code, translation, classification, agentic control — is approached by adapting that pretrained model rather than training fresh. The economics of this approach have reshaped the entire field.

ImageNet Pretraining Fine-tuning LoRA / Adapters Domain Adversarial Training CLIP / SAM Sim-to-Real Meta-learning (MAML)
Chapter VI — A Critical Distinction

Online vs Offline RL: who gets to touch the environment?

The line that matters most in practical reinforcement learning isn't algorithmic — it's about data access. Can the agent interact with the real environment during training, or must it learn entirely from a fixed, pre-collected dataset? The answer reshapes every design choice that follows.

Regime A

Online RL

The classical setting. The agent acts in the environment, observes consequences, and uses fresh experience to update its policy — repeatedly, for millions of steps.

DataGenerated on the fly by current policy
FeedbackImmediate; closed loop
CostLow if simulator exists; prohibitive otherwise
RiskExploration can damage hardware or people
AlgorithmsPPO, SAC, TD3, DQN, A3C
Best forGames, simulated robotics, digital systems with cheap rollouts
Regime B

Offline RL

Also called batch RL. The agent is given a fixed dataset of logged transitions — collected by some prior policy, possibly a human — and must extract the best policy it can without any further interaction.

DataFixed dataset, no new rollouts permitted
FeedbackNone during training
CostOne-time data collection; reuses existing logs
RiskDistributional shift — policy drifts off-support
AlgorithmsCQL, BCQ, IQL, TD3+BC, AWAC
Best forHealthcare, autonomous driving, industrial control

§ The distributional shift problem

In offline RL, naively applying online algorithms like DQN or SAC fails catastrophically. The learned Q-function becomes optimistic about actions that weren't in the dataset, because no counter-evidence exists. The policy picks those phantom-optimal actions, and at deployment produces nonsense. Modern offline algorithms (CQL, IQL, BCQ) solve this by penalizing out-of-distribution actions or constraining the policy to stay close to the data-generating distribution.

§ Choosing between them

Use online RL when a fast, accurate simulator exists or the environment is a digital system where mistakes are cheap. Use offline RL when interaction is expensive, slow, or dangerous — a car on a public road, a patient on a drug-dosing protocol, a chemical plant in operation. Many real deployments are hybrid: pretrain offline on logs, fine-tune online once a safe baseline is established.

A comparison, at a glance.

Fig. 02 — the paradigms on six axes
Input signal Learning goal Data needed Failure mode Canonical win
Supervised (x, y) pairs Predict y from x Labeled & large Distribution shift at deployment Image classification, forecasting
Unsupervised x only Discover structure in x Unlabeled & large Structure found is not useful Foundation model pretraining
Reinforcement State, action, reward Maximize expected return Environment access Reward hacking, poor sample efficiency Game-playing, adaptive control
Imitation Expert demonstrations Reproduce expert behavior Demo trajectories Covariate shift, compounding errors Autonomous driving, teleop robots
Transfer Source + target task data Adapt prior knowledge Pretrained model + small target set Negative transfer when source differs LLM fine-tuning, sim-to-real
Online RL Live interaction Optimize via exploration Fast, safe environment Exploration damages hardware Simulated robotics, games
Offline RL Logged transitions Policy from fixed data Large behavioral dataset Distributional shift, OOD actions Healthcare, industrial control