The Paradigms of Machine Learning

The way most introductions to machine learning are structured, you’d think it was one discipline. It isn’t. It’s six or seven disciplines that share a vocabulary and not much else — the structural difference between learning a classifier from labels, finding clusters with no labels, learning to act in an environment that gives you reward signals, copying a human demonstrator, and adapting a model trained on one task to work on another is substantial. Each is a different theory of what “learning” means.

This primer lays the paradigms out side by side. The goal is to give you a clean mental map: when someone says “we’re using deep learning here,” you should be able to ask which paradigm they’re using, and that question should produce a clean answer.

What it covers

Six paradigm chapters plus comparison and context, about twenty minutes to read.

Supervised Learning. The default. Labels in, function out. Classification and regression. Why this paradigm gets all the attention even though it covers less than half of practical ML.

Unsupervised Learning. No labels, only data. Clustering, dimensionality reduction, generative modeling. The paradigm that quietly powers feature learning, anomaly detection, and the embedding-based recommendation systems we all use.

Reinforcement Learning. No labels, only consequences. Why this paradigm took decades to mature, why it works so well in games and so unevenly elsewhere, and the eight-post field guide on this site that goes much deeper.

Imitation Learning. Learning from demonstrations rather than labels. BC, DAgger, GAIL. Why this paradigm got large again with the rise of robot learning.

Transfer Learning. Reusing what a model learned about one task to do another. The single most economically important paradigm in the LLM era — fine-tuning is transfer learning at scale.

Online vs Offline RL. A short detour into the subtlety that’s reshaping RL in the 2020s: does the agent get to interact with the environment, or is it stuck learning from a fixed dataset?

Plus a comparison section that lines up all six paradigms by data type, signal, application domain, and characteristic failure mode.

Read it

Open the primer →

← Back to Autonomy