A Field Guide to Regression
Interactive Reference Rev. 1.0 · 2026 Methods · Data · Practice

Fitting lines to
the world, honestly.

A practical tour of regression — what it is, what it isn't, which algorithm to reach for, how much data you really need, and what to do when your labels are wildly imbalanced. With live demos you can poke at.

12 Algorithms compared
2 Live demos
98:2 Imbalance, fully treated
7 Sections
01 / PRIMER

First, a necessary distinction.

A scenario worth opening with: you have data that's 98% one class and 2% the other, and you want a model to detect the rare class. That is almost certainly a classification problem, not a regression one — but the confusion is extremely common and entirely reasonable, because a workhorse algorithm for exactly that problem is called logistic regression. Let's untangle it.

Regression

Predicts a continuous number

  • Output is a real number on a continuum
  • "How many kWh will the motor draw?"
  • "What temperature will the stator reach?"
  • "How long until bearing failure?"
  • Loss functions: MSE, MAE, Huber
  • Evaluated with R², RMSE, MAE
Classification

Predicts a discrete category

  • Output is a class label (or probability of one)
  • "Is this fault a short circuit? Yes / No"
  • "Which of 5 failure modes is this?"
  • "Will this transaction be fraudulent?"
  • Loss functions: cross-entropy, hinge
  • Evaluated with accuracy, precision, recall, F1, AUC
The naming trap

Logistic regression is a classification algorithm. The word "regression" in the name refers to the fact that it regresses (fits) a linear model in log-odds space — but its output is a probability between 0 and 1, which you threshold into a class. If someone says "our regression model predicts churn / fraud / failure as yes/no," they mean logistic regression, and they're doing classification.

Cases like 98% negative / 2% positive are definitionally classification. True regression has no "positive" or "negative" — only values. Everything about the imbalance question — SMOTE, class weights, threshold tuning, precision-recall curves — lives on the classification side. We'll treat it fully in §05.

What regression actually is, formally

Regression is the problem of learning a function f: ℝⁿ → ℝ from observed pairs (xᵢ, yᵢ), so that ŷ = f(x) approximates the true, unknown relationship between inputs and a continuous response. The fit is governed by a loss function — most often squared error — and we want the learned f to generalize to inputs it hasn't seen.

Every regression method you'll meet is an answer to three questions, in some combination:

  1. What family of functions can f live in? (lines, polynomials, trees, splines, neural nets, Gaussian processes…)
  2. What does it mean to be close? (squared error, absolute error, quantile loss, robust losses…)
  3. How do we keep it from memorizing the training set? (L1/L2 penalties, early stopping, tree depth limits, dropout, Bayesian priors…)

Those three levers — function family, loss, regularization — are the entire game.

02 / TYPES

A taxonomy of regression problems.

Regression problems come in shapes. Knowing which shape you have before you pick an algorithm saves you a month of debugging.

Problem shape What it means Typical example What to reach for
Simple linear One feature, approximately linear relationship Efficiency vs. speed at fixed load OLS; establishes baseline
Multiple linear Many features, additive linear structure Power consumption vs. speed, load, temp, humidity OLS, Ridge if collinearity
Polynomial / basis Smooth nonlinear in known form Iron-loss vs. frequency (has known f and terms) Polynomial regression, splines
General nonlinear Unknown nonlinear shape, many features Flux map ψ(id, iq) from measurements Random Forest, Gradient Boosting, NN
Time series Output depends on time / previous values Stator temp over drive cycle ARIMA, state-space, LSTM, TCN
Multi-output Predict several continuous targets jointly Predict id*, iq* for MTPA Multi-task NN, Chained MOR, joint GP
Quantile Predict a specific quantile, not the mean P90 worst-case current for sizing Quantile regression, quantile GBM
Count / rate Response is a non-negative integer count Faults per 1000 hours Poisson / negative-binomial GLM
Survival Time-to-event with censoring Remaining useful life Cox PH, AFT, DeepSurv
Rule of thumb

Always start with the simplest model that could plausibly work. A linear model with well-engineered features beats a transformer on small tabular data about 90% of the time, and it tells you something about the world on the way there.

03 / ALGORITHMS

Nine methods, honestly compared.

No algorithm dominates. Each is good at something, bad at something else. Here's what each one actually does, when to reach for it, and what breaks it.

Demo 01Regression fit explorer
Train R²
Test R²
Train MSE
Model
Noise σ0.25
Regularization λ0.10
Training samples25

Watch: with Poly deg 15 and low noise, the curve snakes through every training point but fails wildly on test. That's overfitting — the classic failure mode that regularization is designed to fix. Turn up λ with Ridge or Lasso selected and watch the fit pull back toward sanity.

LinearParametricInterpretable

Ordinary Least Squares

min ‖y − Xβ‖²

The granddaddy. A closed-form solution exists: β̂ = (XᵀX)⁻¹Xᵀy. Every prediction is a weighted sum of features. Coefficients have units and meaning.

Use when
Relationship is roughly linear, features roughly independent, you want interpretability.
Breaks when
Collinearity, nonlinearity, heavy outliers, or n < p.
LinearL2 Reg.

Ridge Regression

min ‖y − Xβ‖² + λ‖β‖²

OLS with an L2 penalty on the coefficients. Stabilizes the fit under collinearity and when XᵀX is ill-conditioned. Shrinks coefficients toward zero but rarely makes them exactly zero.

Use when
Many correlated features, you want smooth shrinkage, n not much bigger than p.
Tuning
λ via cross-validation. Standardize features first.
LinearL1 Reg.Sparse

Lasso Regression

min ‖y − Xβ‖² + λ‖β‖₁

L1 penalty produces sparse solutions — it drives some coefficients to exactly zero. Doubles as a feature-selection method. Less stable than Ridge when features are highly correlated.

Use when
You suspect only a few features matter and want the model to tell you which.
Careful
Among correlated features, Lasso picks one and zeros the rest — which one is somewhat arbitrary.
LinearL1+L2

Elastic Net

min ‖y − Xβ‖² + λ₁‖β‖₁ + λ₂‖β‖²

Convex combination of Lasso and Ridge. Gets sparsity from L1 and grouping / stability from L2. In practice, often the default choice for regularized linear regression on real data.

Use when
Correlated feature groups you want to select together, n < p scenarios.
Tuning
Two knobs: α (mix ratio) and λ (total strength).
Linear in βBasis expansion

Polynomial Regression

y = β₀ + β₁x + β₂x² + ... + βₖxᵏ

Still linear regression — just with x², x³, … as additional features. Higher degree = more flexibility, but degree > 3 is where overfitting lives. Almost always better to use splines instead.

Use when
You have strong prior that the shape is polynomial (e.g. physics).
Careful
Center and scale x before expanding, or you'll get brutal collinearity.
NonlinearEnsembleTabular king

Random Forest Regressor

ŷ = (1/T) Σ treeₜ(x)

Average of many decorrelated decision trees. Handles nonlinearity and interactions automatically. No scaling needed. Robust to outliers in x. Cannot extrapolate beyond training range.

Use when
Tabular data, nonlinear, you want a strong baseline with minimal tuning.
Hyperparams
n_estimators, max_depth, min_samples_leaf, max_features.
NonlinearEnsembleSOTA tabular

Gradient Boosting (XGBoost / LightGBM)

Fₘ(x) = Fₘ₋₁(x) + η · hₘ(x)

Trees built sequentially, each correcting the residuals of the previous. Typically the winner on tabular Kaggle. Very sensitive to hyperparameters but rewards careful tuning.

Use when
Tabular data & you need the best accuracy you can get.
Hyperparams
Learning rate, depth, # rounds, row/col subsampling, L1/L2 on leaves.
NonlinearKernel

Support Vector Regression

min ½‖w‖² + C · Σ max(0, |yᵢ−ŷᵢ|−ε)

Kernel method that fits a tube of width ε around the function — predictions inside the tube incur no loss. With RBF kernel, handles smooth nonlinearity. Scales poorly past ~10k samples.

Use when
Smooth function, moderate dataset size, you want a principled nonlinear fit.
Careful
Always standardize inputs. Three knobs: C, ε, kernel params.
NonlinearBayesianUncertainty

Gaussian Process Regression

f ~ GP(μ(x), k(x,x'))

A distribution over functions defined by a kernel. Gives you not just a prediction but a principled uncertainty estimate — essential for active learning, Bayesian optimization, and safety-critical work. O(n³) cost limits it to small datasets.

Use when
Small n (< ~10k), you need calibrated uncertainty, smooth functions.
Careful
Kernel choice matters. Learn hyperparams by maximizing marginal likelihood.
NonlinearUniversal approximator

Neural Network Regression

ŷ = Wₗσ(Wₗ₋₁σ(...σ(W₁x + b₁)))

Stacked affine + nonlinearity layers. Universal approximator given enough width/depth. Shines on high-dimensional, structured data (images, sequences, sensor streams). Hungry for data and compute; needs careful regularization.

Use when
Big data (>10⁴ samples), structured inputs, nonlinear with unknown shape.
Careful
Normalize inputs & outputs. Use early stopping, dropout, weight decay. Huber loss for noisy targets.
NonlinearSmooth

Splines / GAM

y = β₀ + Σⱼ fⱼ(xⱼ) + ε

Generalized Additive Models fit a smooth function of each feature and sum them. Interpretable like linear regression, flexible like trees, with per-feature partial effect plots. An underused middle ground.

Use when
You need nonlinearity and per-feature interpretability.
Tooling
R's mgcv, Python's pygam.
Instance-basedLazy

k-Nearest Neighbors

ŷ(x) = avg(y of k nearest training points)

Almost embarrassingly simple. No training phase. Surprisingly strong baseline when you have dense, well-scaled features and enough data. Collapses in high dimensions (curse of dimensionality).

Use when
Quick baseline, low-dimensional problems, <10⁵ samples.
Careful
Always standardize. Distance metric & k both matter.
04 / DATA

The data demands more than the algorithm does.

Most "model problems" are actually data problems. Before tuning anything, interrogate your data against the checklist below.

Sample size

A rough rule for linear models: you want at least 10–20 observations per predictor. For a 30-feature model, that's 300–600 rows minimum before inference is trustworthy. Tree ensembles and neural nets need more — typically thousands — because they have effectively many more parameters.

Feature scaling

Required for: Ridge, Lasso, SVR, k-NN, Neural Nets, PCA, GP. Not required for: OLS (mathematically), Trees, Random Forest, Gradient Boosting. When in doubt, standardize (zero mean, unit variance) — it never hurts for distance-based or gradient-based methods.

Missing values

Never silently drop rows — you might be dropping a pattern. Three defensible options: (1) imputation with median / mean / model-based (k-NN, MICE); (2) add a missingness indicator feature; (3) use an algorithm that handles NaN natively (LightGBM, XGBoost).

Outliers

OLS is brutally sensitive — a single extreme point can tilt the whole fit. If outliers are real and meaningful, use Huber loss or quantile regression. If they're errors, fix them in preprocessing. Don't just Winsorize without understanding why.

Multicollinearity

When features are highly correlated, OLS coefficients become unstable and their interpretation breaks down. Diagnose with VIF (variance inflation factor) — VIF > 10 is a red flag. Remedies: drop one of the pair, use Ridge, or do PCA.

The four OLS assumptions

OLS inference (standard errors, p-values, CIs) relies on:

  1. Linearity — the relationship between features and target is truly linear. Check: residual vs. fitted plot should look like random noise, no curvature.
  2. Independence of errors — usually violated by time series or clustered data. Fix with GLS, mixed models, or explicit time-series methods.
  3. Homoscedasticity — residuals have constant variance. Check: no funnel shape in residual plot. Fix with log-transform or weighted least squares.
  4. Normality of residuals — needed for small-sample inference only. Check: Q-Q plot.

Train / validation / test

Split once at the start. A classic split: 60/20/20, or 80/20 with k-fold CV on the 80. For time series, always do chronological splits — random splits leak the future into the past.

Leakage — the silent killer

Any information at training time that wouldn't be available at prediction time is leakage. Common culprits: scaling on full data before split (test stats leak into train), target-derived features, future-looking features in time series, ID columns that encode the target. If your R² looks too good, you have leakage until proven otherwise.

Feature engineering beats model tuning

A thoughtful log transform, a well-placed interaction term, or a domain-informed basis (e.g. Park transform for motor control) will almost always yield more improvement than another round of hyperparameter sweeping. Spend 80% of your time understanding features; 20% on models.

A quick checklist

Before training: ✓ Plot every feature's distribution. ✓ Plot y vs each feature. ✓ Compute correlation / VIF. ✓ Identify missingness. ✓ Flag outliers. ✓ Decide your split before touching the model. ✓ Define success metric before training, not after.

05 / IMBALANCE

The 98/2 problem, properly treated.

Time to come back to the scenario from §01: 98% class 0, 2% class 1, and a model that needs to detect the minority. It's a classification problem with severe class imbalance — and that combination breaks more naive pipelines than almost any other data shape. Here's the full toolkit.

Trap #1 — the accuracy lie

A model that always predicts "class 0" is 98% accurate on your data and completely useless. Accuracy is the wrong metric for imbalanced problems — delete it from your vocabulary. Use precision, recall, F1, AUC-PR, and the confusion matrix instead.

Demo 02Imbalance, thresholds, and what the model actually learns
Precision
Recall
F1
Accuracy
Pred: Neg
Pred: Pos
Actual: Neg
True Negative
False Positive
Actual: Pos
False Negative
True Positive

Read the matrix: rows = truth, columns = prediction. The diagonal is what you want; the off-diagonal is your pain. In imbalance, False Negatives (missed positives) usually cost much more than False Positives.

Strategy
Decision threshold0.50
Class ratio (majority : minority)98:2

Try this: Leave on Baseline, set threshold to 0.5 — almost no positives detected. Now drop threshold to 0.2 — recall rockets up, precision collapses. Now try SMOTE at threshold 0.5 — the decision boundary shifts toward the minority. There's no "right answer" — the right threshold depends on how much you care about FP vs FN in your business.

The full toolkit, ranked by what I'd try first

Technique What it does When to use Caveats
Use the right metrics Switch from accuracy to precision, recall, F1, AUC-PR, Matthews Correlation Coefficient. Always. This isn't optional — it's step zero. AUC-ROC can be misleading in extreme imbalance; AUC-PR is more honest.
Threshold tuning Don't use the default 0.5 cutoff. Pick the threshold that optimizes your business metric on a validation set. Always try this first — it's free. You have to commit to what you care about (e.g. recall ≥ 0.8 at best precision).
Class weights Tell the loss function that each minority example counts more. class_weight='balanced' in sklearn. Simple, no data manipulation, works for most linear / tree models. Can make training unstable. Combine with early stopping.
Cost-sensitive learning Explicitly set the cost matrix: cost(FN), cost(FP). Closely related to class weights but driven by business cost. When you can actually quantify the cost of a miss vs. a false alarm. Getting the costs wrong is worse than not using them.
Random oversampling Duplicate minority examples until classes are balanced. Small data, quick baseline. Can overfit to duplicated samples. Never apply before the train/test split.
SMOTE / ADASYN Create synthetic minority samples by interpolating between minority neighbors. Tabular data, decently separable classes. Can produce noisy synthetic points in overlapping regions. Apply only to training fold.
Random undersampling Throw away majority examples until balanced. Huge majority class, compute-limited training. You're discarding real data. Consider ensemble methods that undersample many times.
Focal loss A modified cross-entropy that down-weights easy examples and focuses on hard ones. FL = -(1−p)^γ log(p). Deep learning, very high imbalance (object detection's home). Additional hyperparameter γ. Requires deep-learning framework.
Anomaly detection reframe If minority is very rare (<1%), treat it as anomaly detection: Isolation Forest, One-Class SVM, autoencoder reconstruction. Fraud detection, manufacturing defects, rare fault modes. You train only on majority ("normal"). Evaluation still needs minority examples.
Get more minority data Targeted data collection, label more positives, active learning. When it's actually feasible. Often beats any algorithmic trick. Expensive, slow. But usually the highest-leverage move.
AUC-PR
The honest metric under imbalance
Train-only
Resample after the split, never before
Threshold ≠ 0.5
Always tune on validation
Trap #2 — resampling leakage

Apply SMOTE, oversampling, or undersampling only to the training fold, inside your cross-validation loop. If you resample on the full dataset before splitting, synthetic or duplicated samples leak into your test set and your evaluation becomes meaningless. Use imblearn.pipeline.Pipeline to do this safely.

Trap #3 — calibration

After resampling, your model's output probabilities no longer reflect the true population rate. If you need calibrated probabilities (for risk scoring, expected-cost calculations), apply CalibratedClassifierCV or Platt scaling on a held-out set from the original, unbalanced distribution.

A practical recipe for 98/2

# A defensible baseline for 98/2 imbalance in sklearn
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.metrics import average_precision_score, precision_recall_curve

pipeline = Pipeline([
    ('smote', SMOTE(sampling_strategy=0.3, random_state=42)),
    ('clf', GradientBoostingClassifier(n_estimators=300, max_depth=3))
])

# Stratified k-fold preserves class ratios in each fold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Score with AUC-PR, not accuracy.
scores = cross_val_score(pipeline, X, y, cv=cv, scoring='average_precision')

# After fitting: tune threshold on validation, not at 0.5.
probs = pipeline.predict_proba(X_val)[:, 1]
p, r, th = precision_recall_curve(y_val, probs)
f1 = 2*p*r / (p + r + 1e-9)
best_threshold = th[f1[:-1].argmax()]
06 / METRICS

Measuring honestly.

The metric you choose decides what your model optimizes for. Pick it before you train, not after.

For regression

MSE — Mean Squared Error

Average of squared residuals. Penalizes large errors quadratically. Units are squared (awkward). The loss most algorithms actually optimize.

RMSE — Root Mean Squared Error

Square root of MSE. Same units as y. The go-to for reporting when error magnitude matters.

MAE — Mean Absolute Error

Average of absolute residuals. Robust to outliers. Reports "typical" error size, not "worst-case-ish" error size.

R² — Coefficient of determination

Proportion of variance explained: R² = 1 − SS_res / SS_tot. Dimensionless, 1.0 is perfect, 0 is as bad as predicting the mean, negative is possible (you did worse than the mean) and should make you reconsider your life choices.

MAPE — Mean Abs. Percentage Error

Scale-free, interpretable as a percent. Explodes when actual values are near zero. Use SMAPE or MASE instead when that's a concern.

For classification (esp. imbalanced)

Precision = TP / (TP + FP)

Of predicted positives, what fraction are actually positive? High precision = few false alarms.

Recall = TP / (TP + FN)

Of actual positives, what fraction did we catch? High recall = few misses. Also called sensitivity or true positive rate.

F1 = 2 · P · R / (P + R)

Harmonic mean of precision and recall. A single number that rewards models doing well on both. Default choice for imbalanced problems.

AUC-ROC

Area under the ROC curve (TPR vs. FPR). Probability the model ranks a random positive higher than a random negative. Overly optimistic under heavy imbalance — use AUC-PR instead.

AUC-PR (Average Precision)

Area under the Precision-Recall curve. The honest metric for imbalanced problems because it doesn't get inflated by the huge TN count.

Matthews Correlation Coefficient (MCC)

A single number in [−1, 1] that summarizes the whole confusion matrix. Arguably the most balanced single metric for imbalance. Underused.

Log Loss / Cross-Entropy

Penalizes confident wrong predictions heavily. Use when you need calibrated probabilities.

Pick by asking

Regression: Do outliers matter? → MAE. Worst cases matter? → RMSE. Relative errors matter? → MAPE / MASE. Need a dimensionless summary? → R².

Classification: Balanced classes? → Accuracy + F1. Imbalanced? → AUC-PR + F1 + threshold-specific precision/recall. Need probabilities? → Log loss + calibration plot.

07 / DECIDE

Which algorithm, actually?

A flowchart is never complete, but it's usually a faster starting point than staring at a textbook. Start at the top, answer honestly, end up somewhere reasonable.

The honest meta-advice

Pick two models from the flowchart that seem reasonable. Fit both with default hyperparameters. Compare on a held-out test set with the right metric. If a linear model is within 5% of your fanciest option, use the linear model — it's faster, interpretable, and has fewer ways to fail silently. Complexity should have to earn its keep.