An interactive tour of the algorithms that fit models and minimize loss — from closed-form Least Squares to adaptive gradient methods. Every visualization runs real math in your browser: no mocked trajectories, no pre-baked curves.
Nearly every fitting, learning, and control problem can be written as:
where θ is the parameter vector we are free to choose and J(θ) is the cost (or loss, or objective). The entire zoo of methods differs only in how they search for θ★.
Solve ∇J(θ) = 0 directly. Works when the problem is linear in parameters and the cost is quadratic. Examples: Least Squares, Wiener filter.
Move in the direction of steepest descent. Scales to millions of parameters. Examples: GD, SGD, Adam, AdaGrad.
Use the Hessian (or an approximation) for better step direction. Examples: Newton, Gauss–Newton, Levenberg–Marquardt, BFGS.
Three properties of the problem determine which algorithm you should reach for:
| Property | What it means | Affects |
|---|---|---|
| Convexity | Does the cost have one global minimum or many local ones? | Whether first-order methods converge to the global optimum. |
| Conditioning | Ratio of largest to smallest curvature (the condition number κ). | Convergence speed. Ill-conditioned problems crawl. |
| Data access | All samples at once (batch) or one-at-a-time (streaming)? | Choice between LS / RLS, GD / SGD. |
| Dimensionality | How large is n? | Second-order methods cost O(n³); first-order methods O(n). |
| Noise | Is ∇J exact or stochastic? | Need variance-reducing tricks (momentum, mini-batches). |
Optimization = descending a mountain blindfolded. You can feel the slope under your feet (gradient), sometimes you can feel the curvature (Hessian), and in special cases (quadratic bowls) you can solve for the bottom analytically.
The oldest trick in the book (Gauss, 1795). Given data pairs (x_i, y_i), find parameters θ of a model ŷ = φ(x)ᵀ θ that minimize the sum of squared residuals:
Stacking φ(x_i)ᵀ as rows of the regressor matrix X ∈ ℝ^{N×n}, setting ∇J = 0 gives the normal equations:
Click on the plot to place noisy data points. Change the polynomial degree and watch the fit — and the condition number of XᵀX — change. Overfit with degree 10 to see what happens.
When samples have unequal reliability (e.g. varying sensor noise), weight them with W = diag(w_i):
Rule of thumb: w_i = 1/σ_i² gives the maximum-likelihood estimate under Gaussian noise.
Add λ‖θ‖² to the cost. Shrinks parameters, stabilizes ill-conditioned XᵀX:
Try it in the widget above — large λ makes the polynomial smoother.
Penalty λ‖θ‖₁ drives coefficients to exactly zero → sparse solutions for feature selection. No closed form; solved via coordinate descent or ISTA.
Accounts for noise in both X and y (errors-in-variables). Solved via SVD — pick the right-singular vector of [X y] corresponding to the smallest singular value.
Forming XᵀX squares the condition number. For ill-conditioned regressors, use QR decomposition (X = QR, solve Rθ = Qᵀy) or SVD. In Python: np.linalg.lstsq. In MATLAB: the backslash operator X\y.
✓ Always center and scale your features before fitting.
✓ Check cond(X) — if > 10⁸, you need ridge or fewer features.
✓ Quadratic cost + linear model → use LS. Don't iterate.
✓ For real-time estimation use RLS (next section), not re-fitting.
Batch LS requires all N samples upfront. In control, adaptive filtering, and online identification we instead want to update θ̂ as each new sample arrives — and optionally forget old data so we track time-varying parameters.
Given a new sample (φ_k, y_k), forgetting factor λ ∈ (0,1], and previous estimate θ̂_{k-1} with covariance P_{k-1}:
Start with P_0 = α I for large α (≈ 10³–10⁶) to express initial uncertainty, and θ̂_0 = 0 or a best guess.
The true parameter a(t) drifts over time. Batch LS is trapped by stale data; RLS with a forgetting factor can track the change. Play with λ: smaller = faster tracking but noisier.
| λ | Effective memory (≈ 1/(1-λ)) | Behavior | Typical use |
|---|---|---|---|
| 1.000 | ∞ | Pure integrator — converges to batch LS | Stationary parameters |
| 0.999 | 1000 samples | Very slow forgetting, low variance | Slowly drifting physical constants |
| 0.99 | 100 samples | Default for most systems | General-purpose |
| 0.95 | 20 samples | Fast tracking, noisier | Rapid disturbances |
| < 0.9 | < 10 | Covariance can explode (windup) | Use with caution + UD factorization |
If the regressor φ_k stops exciting all directions (e.g., constant input), P_k grows unboundedly along unexcited directions. When an excitation finally arrives, the estimate can jump wildly. Fixes: directional forgetting, bounded-gain RLS, or the exponentially-weighted covariance reset.
For RLS to identify all parameters, the regressor must be persistently exciting: roughly, it must span ℝ^n over any finite window. In practice — inject a small dither signal or use naturally rich inputs (PRBS, swept sinusoids).
When the cost is not quadratic in θ — or when XᵀX is too large to invert — solve iteratively by moving in the direction of steepest descent:
The single hyperparameter is the learning rate (or step size) α. Too small → crawl. Too large → diverge. Just right → convergence, but at what rate?
| Problem class | Best α | Iterations to ε | Rate |
|---|---|---|---|
| Strongly convex, smooth | 2/(L+μ) | O(κ·log(1/ε)) | Linear |
| Convex, smooth | 1/L | O(1/ε) | Sublinear |
| Non-convex, smooth | 1/L | O(1/ε²) to ∇J≈0 | Sublinear to stationarity |
Here L is the Lipschitz constant of ∇J, μ is the strong-convexity constant, and κ = L/μ is the condition number. Ill-conditioned problems converge slowly because κ is large.
Watch how step size and curvature interact. The quadratic bowl with adjustable eccentricity directly shows the condition-number effect; Rosenbrock shows how narrow valleys trap vanilla GD.
For a quadratic with Hessian eigenvalues λ_max = L, λ_min = μ:
α < 2/L → convergence guaranteedα = 2/(L+μ) → optimal rateα > 2/L → divergenceIn practice you estimate L via line search or backtracking (Armijo rule).
Real-world costs are almost always a sum over data:
Computing the full gradient ∇J costs O(N) per step. With N in the millions, that is prohibitive. The remedy:
Exact gradient. Slow per step, smooth convergence. Used when N is tiny.
Unbiased but noisy gradient. N× faster per step — at the price of variance.
The standard for deep learning. |B| = 32, 64, …, 512 balances noise vs throughput.
Same problem, same learning rate, three trajectories. Note how SGD converges in expectation but oscillates around the optimum; full-batch marches straight there but one step takes many times more FLOPs.
Constant α causes SGD to oscillate forever (variance of gradient estimate never vanishes). Decay the step size to reach the true optimum:
| Schedule | Formula | Use case |
|---|---|---|
| Inverse time | α_k = α₀ / (1 + kτ) | Theoretical guarantees (Robbins–Monro) |
| Step decay | α_k = α₀ · γ^⌊k/s⌋ | Classic "÷ 10 every 30 epochs" |
| Cosine | α_k = α_min + ½(α₀ - α_min)(1+cos(πk/K)) | Modern deep learning default |
| Warmup + decay | Linear ramp → cosine | Transformers, large batches |
| OneCycle | Triangular up then down | Fast-AI, super-convergence |
For SGD to converge almost-surely to a stationary point, the step sizes must satisfy:
Example: α_k = 1/k qualifies. Constant α does not — it gives only a noise ball of radius O(√α).
Vanilla GD ignores past gradients and has no sense of per-parameter scale. Three ideas — momentum, adaptive learning rates, and their combination — dominate deep learning.
Average recent gradients to build velocity through narrow valleys:
Typical β = 0.9. This turns first-order descent into a damped harmonic oscillator — the step in each eigendirection is amplified by a factor 1/(1-β) when gradients are consistent, and damped when they oscillate.
Lookahead trick: evaluate the gradient at where you'd be after a momentum step.
Gives a provably better O(1/k²) rate on smooth convex problems (vs O(1/k) for plain GD).
Parameters with historically large gradients get smaller steps; rare-feature parameters get larger steps. Excellent for sparse problems. Weakness: G_k only grows, so the effective learning rate decays to zero — deep nets stall.
Use an EMA (γ ≈ 0.9) instead of a cumulative sum. Now adaptive-LR works even on non-stationary objectives.
Defaults (Kingma & Ba, 2014): α = 10⁻³, β₁ = 0.9, β₂ = 0.999, ε = 10⁻⁸. These rarely need tuning.
All five run simultaneously on the same cost surface with reasonable defaults. Try Rosenbrock — momentum methods escape the banana-shaped valley; vanilla GD does not.
Standard Adam with L2 regularization is subtly broken: the regularizer gets rescaled by √v̂, which makes it inconsistent. AdamW applies weight decay directly to the parameters:
AdamW is the modern default for Transformers and most deep-learning tasks.
First-order methods use only ∇J. Second-order methods additionally exploit curvature H = ∇²J, giving a dramatically better step direction.
Near the optimum, converges quadratically: the error squares each iteration. For a quadratic cost, Newton reaches the optimum in one step. The price is O(n³) per iteration to factor the Hessian — unaffordable for large n.
When J(θ) = ½‖r(θ)‖² (nonlinear least squares), drop the second-derivative term of H:
where J_r is the Jacobian of the residuals. Cheaper and often more stable than full Newton.
Adaptive blend of Gauss–Newton and gradient descent:
μ small → Gauss–Newton (fast local convergence). μ large → gradient descent (robust). Trust-region algorithms adjust μ based on step quality. The go-to method for nonlinear LS in practice (used by every curve-fitting toolbox).
Build an approximation B_k ≈ H_k from gradient history using rank-1 or rank-2 updates — avoiding Hessian computation entirely.
L-BFGS (Limited-memory BFGS) stores only the last m vector pairs (m ≈ 5–20). Memory O(mn) instead of O(n²). For moderate-size convex problems, L-BFGS is often the best choice.
lsqcurvefit and scipy.optimize.least_squares call.| Method | Default α | Sweet-spot range | Failure signal |
|---|---|---|---|
| SGD (no momentum) | 0.01 | 10⁻³ – 10⁻¹ | Loss oscillates → decay |
| SGD + momentum | 0.1 | 10⁻² – 1.0 | NaNs → cut by 10× |
| Adam / AdamW | 10⁻³ | 10⁻⁴ – 5·10⁻³ | Validation diverges → wd or smaller α |
| RMSProp | 10⁻³ | 10⁻⁴ – 10⁻² | Same as Adam |
| L-BFGS | α=1.0 | Line search picks | Rare; usually a bad gradient |
| Batch size | Noise | Throughput | Best LR (linear scaling) | Notes |
|---|---|---|---|---|
| 1 (pure SGD) | Highest | Lowest | Smallest | Acts as implicit regularizer |
| 32 – 256 | Moderate | Good GPU utilization | 1× – 4× base | Deep-learning default |
| 512 – 8192 | Low | Fastest wall-clock | Needs warmup | Large-scale training |
| Full batch | None | Per-sample fastest | Same as GD | Losses generalization benefits |
Linear scaling rule (Goyal et al., 2017): when multiplying batch size by k, multiply LR by k — and add a warmup period.
Cause: LR too high, gradient explosion, log(0), division by zero.
Fix: Gradient clipping (clip-by-norm, typically 1.0), reduce α by 10×, add small ε inside logs/sqrts, check for mixed precision overflow.
Cause: LR too high (noise ball), saddle point, or model capacity too low.
Fix: LR decay, momentum ≥ 0.9, or bigger model. Check training vs validation gap.
Cause: LR too low, vanishing gradients, or ill-conditioning.
Fix: LR warmup, batch norm / layer norm, residual connections, better init.
Cause: Overfitting — not an optimizer problem.
Fix: Weight decay (AdamW), dropout, data augmentation, early stopping.
Cause: λ too small, loss of excitation, ill-conditioned regressor.
Fix: Directional forgetting, bound tr(P), UD factorization, raise λ toward 1.
Cause: Multicollinearity, cond(X) huge.
Fix: Ridge (λI), drop correlated features, center + scale, use SVD pseudo-inverse.
log Σ exp(x_i) = m + log Σ exp(x_i − m) with m = max(x_i). Avoids overflow.solve() over inv(). Prefer Cholesky for SPD matrices, QR/SVD otherwise.g ← g · min(1, τ/‖g‖). Standard τ = 1.0 for RNNs, 5.0 for Transformers.‖∇J(x) - ∇J(y)‖ ≤ L‖x - y‖. Bounds how fast the gradient can change; sets the largest safe step size.J(y) ≥ J(x) + ∇J(x)ᵀ(y-x) + μ/2‖y-x‖².∇J = 0 but the Hessian is indefinite. Problematic for second-order methods; escaped naturally by SGD's noise.Σ r_i².∇J = 0. Minima, maxima, and saddles all qualify.J(θ + αd) ≤ J(θ) + c·α·∇Jᵀd holds (c ≈ 10⁻⁴).g ← g·τ/‖g‖. Prevents a single bad batch from blowing up training.θ ← (1 − α·λ)θ − α·update. Equivalent to L2 penalty only for plain SGD.Use: linear-in-parameters + quadratic cost. Solve with: QR or SVD — never inv().
Defaults: λ = 0.97, P₀ = 10³·I. Watch for windup.
Max α: 2/L. Best α: 2/(L+μ). Rate: linear if strongly convex.
|B|: 32–256 typical. Need decay to actually converge.
β: 0.9 default, 0.99 for extreme valleys.
Defaults: α=10⁻³, β₁=0.9, β₂=0.999, ε=10⁻⁸. Use AdamW if you want weight decay.
Cost: O(n³)/iter. Rate: quadratic. Only for small n.
Hessian-free quasi-Newton with O(mn) memory. Best for: smooth, moderate-n, convex or mildly non-convex. Often 10–100× faster than SGD when applicable.
If you remember five things: