A practical tour of regression — what it is, what it isn't, which algorithm to reach for, how much data you really need, and what to do when your labels are wildly imbalanced. With live demos you can poke at.
A scenario worth opening with: you have data that's 98% one class and 2% the other, and you want a model to detect the rare class. That is almost certainly a classification problem, not a regression one — but the confusion is extremely common and entirely reasonable, because a workhorse algorithm for exactly that problem is called logistic regression. Let's untangle it.
Logistic regression is a classification algorithm. The word "regression" in the name refers to the fact that it regresses (fits) a linear model in log-odds space — but its output is a probability between 0 and 1, which you threshold into a class. If someone says "our regression model predicts churn / fraud / failure as yes/no," they mean logistic regression, and they're doing classification.
Cases like 98% negative / 2% positive are definitionally classification. True regression has no "positive" or "negative" — only values. Everything about the imbalance question — SMOTE, class weights, threshold tuning, precision-recall curves — lives on the classification side. We'll treat it fully in §05.
Regression is the problem of learning a function f: ℝⁿ → ℝ from observed pairs (xᵢ, yᵢ), so that ŷ = f(x) approximates the true, unknown relationship between inputs and a continuous response. The fit is governed by a loss function — most often squared error — and we want the learned f to generalize to inputs it hasn't seen.
Every regression method you'll meet is an answer to three questions, in some combination:
f live in? (lines, polynomials, trees, splines, neural nets, Gaussian processes…)Those three levers — function family, loss, regularization — are the entire game.
Regression problems come in shapes. Knowing which shape you have before you pick an algorithm saves you a month of debugging.
| Problem shape | What it means | Typical example | What to reach for |
|---|---|---|---|
| Simple linear | One feature, approximately linear relationship | Efficiency vs. speed at fixed load | OLS; establishes baseline |
| Multiple linear | Many features, additive linear structure | Power consumption vs. speed, load, temp, humidity | OLS, Ridge if collinearity |
| Polynomial / basis | Smooth nonlinear in known form | Iron-loss vs. frequency (has known f and f² terms) |
Polynomial regression, splines |
| General nonlinear | Unknown nonlinear shape, many features | Flux map ψ(id, iq) from measurements |
Random Forest, Gradient Boosting, NN |
| Time series | Output depends on time / previous values | Stator temp over drive cycle | ARIMA, state-space, LSTM, TCN |
| Multi-output | Predict several continuous targets jointly | Predict id*, iq* for MTPA |
Multi-task NN, Chained MOR, joint GP |
| Quantile | Predict a specific quantile, not the mean | P90 worst-case current for sizing | Quantile regression, quantile GBM |
| Count / rate | Response is a non-negative integer count | Faults per 1000 hours | Poisson / negative-binomial GLM |
| Survival | Time-to-event with censoring | Remaining useful life | Cox PH, AFT, DeepSurv |
Always start with the simplest model that could plausibly work. A linear model with well-engineered features beats a transformer on small tabular data about 90% of the time, and it tells you something about the world on the way there.
No algorithm dominates. Each is good at something, bad at something else. Here's what each one actually does, when to reach for it, and what breaks it.
Watch: with Poly deg 15 and low noise, the curve snakes through every training point but fails wildly on test. That's overfitting — the classic failure mode that regularization is designed to fix. Turn up λ with Ridge or Lasso selected and watch the fit pull back toward sanity.
The granddaddy. A closed-form solution exists: β̂ = (XᵀX)⁻¹Xᵀy. Every prediction is a weighted sum of features. Coefficients have units and meaning.
OLS with an L2 penalty on the coefficients. Stabilizes the fit under collinearity and when XᵀX is ill-conditioned. Shrinks coefficients toward zero but rarely makes them exactly zero.
L1 penalty produces sparse solutions — it drives some coefficients to exactly zero. Doubles as a feature-selection method. Less stable than Ridge when features are highly correlated.
Convex combination of Lasso and Ridge. Gets sparsity from L1 and grouping / stability from L2. In practice, often the default choice for regularized linear regression on real data.
Still linear regression — just with x², x³, … as additional features. Higher degree = more flexibility, but degree > 3 is where overfitting lives. Almost always better to use splines instead.
Average of many decorrelated decision trees. Handles nonlinearity and interactions automatically. No scaling needed. Robust to outliers in x. Cannot extrapolate beyond training range.
Trees built sequentially, each correcting the residuals of the previous. Typically the winner on tabular Kaggle. Very sensitive to hyperparameters but rewards careful tuning.
Kernel method that fits a tube of width ε around the function — predictions inside the tube incur no loss. With RBF kernel, handles smooth nonlinearity. Scales poorly past ~10k samples.
A distribution over functions defined by a kernel. Gives you not just a prediction but a principled uncertainty estimate — essential for active learning, Bayesian optimization, and safety-critical work. O(n³) cost limits it to small datasets.
Stacked affine + nonlinearity layers. Universal approximator given enough width/depth. Shines on high-dimensional, structured data (images, sequences, sensor streams). Hungry for data and compute; needs careful regularization.
Generalized Additive Models fit a smooth function of each feature and sum them. Interpretable like linear regression, flexible like trees, with per-feature partial effect plots. An underused middle ground.
Almost embarrassingly simple. No training phase. Surprisingly strong baseline when you have dense, well-scaled features and enough data. Collapses in high dimensions (curse of dimensionality).
Most "model problems" are actually data problems. Before tuning anything, interrogate your data against the checklist below.
A rough rule for linear models: you want at least 10–20 observations per predictor. For a 30-feature model, that's 300–600 rows minimum before inference is trustworthy. Tree ensembles and neural nets need more — typically thousands — because they have effectively many more parameters.
Required for: Ridge, Lasso, SVR, k-NN, Neural Nets, PCA, GP. Not required for: OLS (mathematically), Trees, Random Forest, Gradient Boosting. When in doubt, standardize (zero mean, unit variance) — it never hurts for distance-based or gradient-based methods.
Never silently drop rows — you might be dropping a pattern. Three defensible options: (1) imputation with median / mean / model-based (k-NN, MICE); (2) add a missingness indicator feature; (3) use an algorithm that handles NaN natively (LightGBM, XGBoost).
OLS is brutally sensitive — a single extreme point can tilt the whole fit. If outliers are real and meaningful, use Huber loss or quantile regression. If they're errors, fix them in preprocessing. Don't just Winsorize without understanding why.
When features are highly correlated, OLS coefficients become unstable and their interpretation breaks down. Diagnose with VIF (variance inflation factor) — VIF > 10 is a red flag. Remedies: drop one of the pair, use Ridge, or do PCA.
OLS inference (standard errors, p-values, CIs) relies on:
Split once at the start. A classic split: 60/20/20, or 80/20 with k-fold CV on the 80. For time series, always do chronological splits — random splits leak the future into the past.
Any information at training time that wouldn't be available at prediction time is leakage. Common culprits: scaling on full data before split (test stats leak into train), target-derived features, future-looking features in time series, ID columns that encode the target. If your R² looks too good, you have leakage until proven otherwise.
A thoughtful log transform, a well-placed interaction term, or a domain-informed basis (e.g. Park transform for motor control) will almost always yield more improvement than another round of hyperparameter sweeping. Spend 80% of your time understanding features; 20% on models.
Before training: ✓ Plot every feature's distribution. ✓ Plot y vs each feature. ✓ Compute correlation / VIF. ✓ Identify missingness. ✓ Flag outliers. ✓ Decide your split before touching the model. ✓ Define success metric before training, not after.
Time to come back to the scenario from §01: 98% class 0, 2% class 1, and a model that needs to detect the minority. It's a classification problem with severe class imbalance — and that combination breaks more naive pipelines than almost any other data shape. Here's the full toolkit.
A model that always predicts "class 0" is 98% accurate on your data and completely useless. Accuracy is the wrong metric for imbalanced problems — delete it from your vocabulary. Use precision, recall, F1, AUC-PR, and the confusion matrix instead.
Read the matrix: rows = truth, columns = prediction. The diagonal is what you want; the off-diagonal is your pain. In imbalance, False Negatives (missed positives) usually cost much more than False Positives.
Try this: Leave on Baseline, set threshold to 0.5 — almost no positives detected. Now drop threshold to 0.2 — recall rockets up, precision collapses. Now try SMOTE at threshold 0.5 — the decision boundary shifts toward the minority. There's no "right answer" — the right threshold depends on how much you care about FP vs FN in your business.
| Technique | What it does | When to use | Caveats |
|---|---|---|---|
| Use the right metrics | Switch from accuracy to precision, recall, F1, AUC-PR, Matthews Correlation Coefficient. | Always. This isn't optional — it's step zero. | AUC-ROC can be misleading in extreme imbalance; AUC-PR is more honest. |
| Threshold tuning | Don't use the default 0.5 cutoff. Pick the threshold that optimizes your business metric on a validation set. | Always try this first — it's free. | You have to commit to what you care about (e.g. recall ≥ 0.8 at best precision). |
| Class weights | Tell the loss function that each minority example counts more. class_weight='balanced' in sklearn. |
Simple, no data manipulation, works for most linear / tree models. | Can make training unstable. Combine with early stopping. |
| Cost-sensitive learning | Explicitly set the cost matrix: cost(FN), cost(FP). Closely related to class weights but driven by business cost. |
When you can actually quantify the cost of a miss vs. a false alarm. | Getting the costs wrong is worse than not using them. |
| Random oversampling | Duplicate minority examples until classes are balanced. | Small data, quick baseline. | Can overfit to duplicated samples. Never apply before the train/test split. |
| SMOTE / ADASYN | Create synthetic minority samples by interpolating between minority neighbors. | Tabular data, decently separable classes. | Can produce noisy synthetic points in overlapping regions. Apply only to training fold. |
| Random undersampling | Throw away majority examples until balanced. | Huge majority class, compute-limited training. | You're discarding real data. Consider ensemble methods that undersample many times. |
| Focal loss | A modified cross-entropy that down-weights easy examples and focuses on hard ones. FL = -(1−p)^γ log(p). |
Deep learning, very high imbalance (object detection's home). | Additional hyperparameter γ. Requires deep-learning framework. |
| Anomaly detection reframe | If minority is very rare (<1%), treat it as anomaly detection: Isolation Forest, One-Class SVM, autoencoder reconstruction. | Fraud detection, manufacturing defects, rare fault modes. | You train only on majority ("normal"). Evaluation still needs minority examples. |
| Get more minority data | Targeted data collection, label more positives, active learning. | When it's actually feasible. Often beats any algorithmic trick. | Expensive, slow. But usually the highest-leverage move. |
Apply SMOTE, oversampling, or undersampling only to the training fold, inside your cross-validation loop. If you resample on the full dataset before splitting, synthetic or duplicated samples leak into your test set and your evaluation becomes meaningless. Use imblearn.pipeline.Pipeline to do this safely.
After resampling, your model's output probabilities no longer reflect the true population rate. If you need calibrated probabilities (for risk scoring, expected-cost calculations), apply CalibratedClassifierCV or Platt scaling on a held-out set from the original, unbalanced distribution.
# A defensible baseline for 98/2 imbalance in sklearn
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.metrics import average_precision_score, precision_recall_curve
pipeline = Pipeline([
('smote', SMOTE(sampling_strategy=0.3, random_state=42)),
('clf', GradientBoostingClassifier(n_estimators=300, max_depth=3))
])
# Stratified k-fold preserves class ratios in each fold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Score with AUC-PR, not accuracy.
scores = cross_val_score(pipeline, X, y, cv=cv, scoring='average_precision')
# After fitting: tune threshold on validation, not at 0.5.
probs = pipeline.predict_proba(X_val)[:, 1]
p, r, th = precision_recall_curve(y_val, probs)
f1 = 2*p*r / (p + r + 1e-9)
best_threshold = th[f1[:-1].argmax()]
The metric you choose decides what your model optimizes for. Pick it before you train, not after.
Average of squared residuals. Penalizes large errors quadratically. Units are squared (awkward). The loss most algorithms actually optimize.
Square root of MSE. Same units as y. The go-to for reporting when error magnitude matters.
Average of absolute residuals. Robust to outliers. Reports "typical" error size, not "worst-case-ish" error size.
Proportion of variance explained: R² = 1 − SS_res / SS_tot. Dimensionless, 1.0 is perfect, 0 is as bad as predicting the mean, negative is possible (you did worse than the mean) and should make you reconsider your life choices.
Scale-free, interpretable as a percent. Explodes when actual values are near zero. Use SMAPE or MASE instead when that's a concern.
Of predicted positives, what fraction are actually positive? High precision = few false alarms.
Of actual positives, what fraction did we catch? High recall = few misses. Also called sensitivity or true positive rate.
Harmonic mean of precision and recall. A single number that rewards models doing well on both. Default choice for imbalanced problems.
Area under the ROC curve (TPR vs. FPR). Probability the model ranks a random positive higher than a random negative. Overly optimistic under heavy imbalance — use AUC-PR instead.
Area under the Precision-Recall curve. The honest metric for imbalanced problems because it doesn't get inflated by the huge TN count.
A single number in [−1, 1] that summarizes the whole confusion matrix. Arguably the most balanced single metric for imbalance. Underused.
Penalizes confident wrong predictions heavily. Use when you need calibrated probabilities.
Regression: Do outliers matter? → MAE. Worst cases matter? → RMSE. Relative errors matter? → MAPE / MASE. Need a dimensionless summary? → R².
Classification: Balanced classes? → Accuracy + F1. Imbalanced? → AUC-PR + F1 + threshold-specific precision/recall. Need probabilities? → Log loss + calibration plot.
A flowchart is never complete, but it's usually a faster starting point than staring at a textbook. Start at the top, answer honestly, end up somewhere reasonable.
Pick two models from the flowchart that seem reasonable. Fit both with default hyperparameters. Compare on a held-out test set with the right metric. If a linear model is within 5% of your fanciest option, use the linear model — it's faster, interpretable, and has fewer ways to fail silently. Complexity should have to earn its keep.