← Back to Autonomy
Interactive Reference · v1.0

Optimization Methods
for Engineers and Practitioners

An interactive tour of the algorithms that fit models and minimize loss — from closed-form Least Squares to adaptive gradient methods. Every visualization runs real math in your browser: no mocked trajectories, no pre-baked curves.

By Majid Mazouchi

The Optimization Problem

Nearly every fitting, learning, and control problem can be written as:

The canonical form
$$\theta^\star = \arg\min_{\theta \in \mathbb{R}^n}\; J(\theta)$$

where θ is the parameter vector we are free to choose and J(θ) is the cost (or loss, or objective). The entire zoo of methods differs only in how they search for θ★.

Taxonomy

Closed-form

Analytical

Solve ∇J(θ) = 0 directly. Works when the problem is linear in parameters and the cost is quadratic. Examples: Least Squares, Wiener filter.

First-order

Gradient-based

Move in the direction of steepest descent. Scales to millions of parameters. Examples: GD, SGD, Adam, AdaGrad.

Second-order

Curvature-aware

Use the Hessian (or an approximation) for better step direction. Examples: Newton, Gauss–Newton, Levenberg–Marquardt, BFGS.

Why so many methods?

Three properties of the problem determine which algorithm you should reach for:

PropertyWhat it meansAffects
ConvexityDoes the cost have one global minimum or many local ones?Whether first-order methods converge to the global optimum.
ConditioningRatio of largest to smallest curvature (the condition number κ).Convergence speed. Ill-conditioned problems crawl.
Data accessAll samples at once (batch) or one-at-a-time (streaming)?Choice between LS / RLS, GD / SGD.
DimensionalityHow large is n?Second-order methods cost O(n³); first-order methods O(n).
NoiseIs ∇J exact or stochastic?Need variance-reducing tricks (momentum, mini-batches).

Mental model

Optimization = descending a mountain blindfolded. You can feel the slope under your feet (gradient), sometimes you can feel the curvature (Hessian), and in special cases (quadratic bowls) you can solve for the bottom analytically.

Least Squares

The oldest trick in the book (Gauss, 1795). Given data pairs (x_i, y_i), find parameters θ of a model ŷ = φ(x)ᵀ θ that minimize the sum of squared residuals:

Cost function
$$J_{\text{LS}}(\theta) = \tfrac{1}{2}\sum_{i=1}^N \bigl(y_i - \phi(x_i)^\top \theta\bigr)^2 = \tfrac{1}{2}\|y - X\theta\|_2^2$$

Stacking φ(x_i)ᵀ as rows of the regressor matrix X ∈ ℝ^{N×n}, setting ∇J = 0 gives the normal equations:

Normal equations · closed-form solution
$$X^\top X\, \theta = X^\top y \quad\Longrightarrow\quad \boxed{\hat\theta = (X^\top X)^{-1}X^\top y}$$

Interactive: Polynomial Fitting

Click on the plot to place noisy data points. Change the polynomial degree and watch the fit — and the condition number of XᵀX — change. Overfit with degree 10 to see what happens.

LS · Polynomial RegressionN = 0 points
fitted curve data points residuals
Polynomial degree3
Ridge λ (regularization)0.00
Noise σ0.15
RSS

cond(XᵀX)
‖θ‖₂

Weighted & Regularized Variants

Weighted LS (WLS)

When samples have unequal reliability (e.g. varying sensor noise), weight them with W = diag(w_i):

$$\hat\theta = (X^\top W X)^{-1} X^\top W y$$

Rule of thumb: w_i = 1/σ_i² gives the maximum-likelihood estimate under Gaussian noise.

Ridge Regression (L2)

Add λ‖θ‖² to the cost. Shrinks parameters, stabilizes ill-conditioned XᵀX:

$$\hat\theta = (X^\top X + \lambda I)^{-1} X^\top y$$

Try it in the widget above — large λ makes the polynomial smoother.

Lasso (L1)

Penalty λ‖θ‖₁ drives coefficients to exactly zero → sparse solutions for feature selection. No closed form; solved via coordinate descent or ISTA.

Total LS (TLS)

Accounts for noise in both X and y (errors-in-variables). Solved via SVD — pick the right-singular vector of [X y] corresponding to the smallest singular value.

Numerical pitfall · do NOT invert XᵀX

Forming XᵀX squares the condition number. For ill-conditioned regressors, use QR decomposition (X = QR, solve Rθ = Qᵀy) or SVD. In Python: np.linalg.lstsq. In MATLAB: the backslash operator X\y.

Practitioner's checklist

Always center and scale your features before fitting.
✓ Check cond(X) — if > 10⁸, you need ridge or fewer features.
✓ Quadratic cost + linear model → use LS. Don't iterate.
✓ For real-time estimation use RLS (next section), not re-fitting.

Recursive Least Squares

Batch LS requires all N samples upfront. In control, adaptive filtering, and online identification we instead want to update θ̂ as each new sample arrives — and optionally forget old data so we track time-varying parameters.

The Update Equations

Given a new sample (φ_k, y_k), forgetting factor λ ∈ (0,1], and previous estimate θ̂_{k-1} with covariance P_{k-1}:

RLS · three-line update
$$\begin{aligned} K_k &= \frac{P_{k-1}\,\phi_k}{\lambda + \phi_k^\top P_{k-1}\,\phi_k} &&\text{(gain vector)}\\ \hat\theta_k &= \hat\theta_{k-1} + K_k\bigl(y_k - \phi_k^\top \hat\theta_{k-1}\bigr) &&\text{(parameter update)}\\ P_k &= \tfrac{1}{\lambda}\bigl(P_{k-1} - K_k\,\phi_k^\top P_{k-1}\bigr) &&\text{(covariance update)} \end{aligned}$$

Start with P_0 = α I for large α (≈ 10³–10⁶) to express initial uncertainty, and θ̂_0 = 0 or a best guess.

Interactive: Tracking a Time-Varying System

The true parameter a(t) drifts over time. Batch LS is trapped by stale data; RLS with a forgetting factor can track the change. Play with λ: smaller = faster tracking but noisier.

RLS · Online Parameter Trackingt = 0
true θ(t) RLS estimate batch LS P_k (trace, lower panel)
Forgetting factor λ0.97
Initial P₀ = α·I   (α)1000
Measurement noise σ0.1
Drift profile
θ̂
err
tr(P)

Tuning the forgetting factor

λEffective memory (≈ 1/(1-λ))BehaviorTypical use
1.000Pure integrator — converges to batch LSStationary parameters
0.9991000 samplesVery slow forgetting, low varianceSlowly drifting physical constants
0.99100 samplesDefault for most systemsGeneral-purpose
0.9520 samplesFast tracking, noisierRapid disturbances
< 0.9< 10Covariance can explode (windup)Use with caution + UD factorization

Covariance windup

If the regressor φ_k stops exciting all directions (e.g., constant input), P_k grows unboundedly along unexcited directions. When an excitation finally arrives, the estimate can jump wildly. Fixes: directional forgetting, bounded-gain RLS, or the exponentially-weighted covariance reset.

Persistent Excitation

For RLS to identify all parameters, the regressor must be persistently exciting: roughly, it must span ℝ^n over any finite window. In practice — inject a small dither signal or use naturally rich inputs (PRBS, swept sinusoids).

Gradient Descent

When the cost is not quadratic in θ — or when XᵀX is too large to invert — solve iteratively by moving in the direction of steepest descent:

The gradient descent update
$$\theta_{k+1} = \theta_k - \alpha\,\nabla J(\theta_k)$$

The single hyperparameter is the learning rate (or step size) α. Too small → crawl. Too large → diverge. Just right → convergence, but at what rate?

Convergence Rates (essentials)

Problem classBest αIterations to εRate
Strongly convex, smooth2/(L+μ)O(κ·log(1/ε))Linear
Convex, smooth1/LO(1/ε)Sublinear
Non-convex, smooth1/LO(1/ε²) to ∇J≈0Sublinear to stationarity

Here L is the Lipschitz constant of ∇J, μ is the strong-convexity constant, and κ = L/μ is the condition number. Ill-conditioned problems converge slowly because κ is large.

Interactive: GD on 2D Cost Surfaces

Watch how step size and curvature interact. The quadratic bowl with adjustable eccentricity directly shows the condition-number effect; Rosenbrock shows how narrow valleys trap vanilla GD.

GD · 2D Trajectory Exploreriter = 0
Click the contour plot to set a new starting point
Cost surface
Condition κ10
Learning rate α0.050
Iterations80
final θ
final J
‖∇J‖

The learning-rate sweet spot

For a quadratic with Hessian eigenvalues λ_max = L, λ_min = μ:

  • α < 2/L → convergence guaranteed
  • α = 2/(L+μ) → optimal rate
  • α > 2/L → divergence

In practice you estimate L via line search or backtracking (Armijo rule).

Stochastic & Mini-batch Gradient Descent

Real-world costs are almost always a sum over data:

$$J(\theta) = \frac{1}{N}\sum_{i=1}^N J_i(\theta)$$

Computing the full gradient ∇J costs O(N) per step. With N in the millions, that is prohibitive. The remedy:

Batch

Full-batch GD

$$\nabla J = \tfrac{1}{N}\sum_{i}\nabla J_i$$

Exact gradient. Slow per step, smooth convergence. Used when N is tiny.

Stochastic

SGD

$$\hat\nabla J = \nabla J_{i_k},\; i_k \sim \mathcal U(1,N)$$

Unbiased but noisy gradient. N× faster per step — at the price of variance.

Mini-batch

Mini-batch SGD

$$\hat\nabla J = \tfrac{1}{|B|}\sum_{i\in B}\nabla J_i$$

The standard for deep learning. |B| = 32, 64, …, 512 balances noise vs throughput.

Interactive: Linear Regression — Full vs Mini-batch vs SGD

Same problem, same learning rate, three trajectories. Note how SGD converges in expectation but oscillates around the optimum; full-batch marches straight there but one step takes many times more FLOPs.

SGD · Batch-size Comparisonepoch = 0
full batch mini-batch SGD (|B|=1) optimum
Data size N200
Mini-batch size16
Learning rate α0.010
Epochs30

Learning-rate schedules

Constant α causes SGD to oscillate forever (variance of gradient estimate never vanishes). Decay the step size to reach the true optimum:

ScheduleFormulaUse case
Inverse timeα_k = α₀ / (1 + kτ)Theoretical guarantees (Robbins–Monro)
Step decayα_k = α₀ · γ^⌊k/s⌋Classic "÷ 10 every 30 epochs"
Cosineα_k = α_min + ½(α₀ - α_min)(1+cos(πk/K))Modern deep learning default
Warmup + decayLinear ramp → cosineTransformers, large batches
OneCycleTriangular up then downFast-AI, super-convergence

Robbins–Monro conditions

For SGD to converge almost-surely to a stationary point, the step sizes must satisfy:

$$\sum_k \alpha_k = \infty,\qquad \sum_k \alpha_k^2 < \infty$$

Example: α_k = 1/k qualifies. Constant α does not — it gives only a noise ball of radius O(√α).

Momentum, RMSProp, Adam

Vanilla GD ignores past gradients and has no sense of per-parameter scale. Three ideas — momentum, adaptive learning rates, and their combination — dominate deep learning.

Momentum (Polyak heavy ball)

Average recent gradients to build velocity through narrow valleys:

$$\begin{aligned} v_{k+1} &= \beta\, v_k + \nabla J(\theta_k)\\ \theta_{k+1} &= \theta_k - \alpha\, v_{k+1} \end{aligned}$$

Typical β = 0.9. This turns first-order descent into a damped harmonic oscillator — the step in each eigendirection is amplified by a factor 1/(1-β) when gradients are consistent, and damped when they oscillate.

Nesterov Accelerated Gradient (NAG)

Lookahead trick: evaluate the gradient at where you'd be after a momentum step.

$$\theta_{k+1} = \theta_k - \alpha \nabla J(\theta_k + \beta v_k) + \beta v_{k+1}$$

Gives a provably better O(1/k²) rate on smooth convex problems (vs O(1/k) for plain GD).

AdaGrad — per-parameter rescaling

$$\begin{aligned} G_k &= G_{k-1} + g_k \odot g_k\\ \theta_{k+1} &= \theta_k - \alpha\, g_k / (\sqrt{G_k} + \varepsilon) \end{aligned}$$

Parameters with historically large gradients get smaller steps; rare-feature parameters get larger steps. Excellent for sparse problems. Weakness: G_k only grows, so the effective learning rate decays to zero — deep nets stall.

RMSProp — fix AdaGrad's decay

$$\begin{aligned} G_k &= \gamma G_{k-1} + (1-\gamma)\, g_k \odot g_k\\ \theta_{k+1} &= \theta_k - \alpha\, g_k / (\sqrt{G_k} + \varepsilon) \end{aligned}$$

Use an EMA (γ ≈ 0.9) instead of a cumulative sum. Now adaptive-LR works even on non-stationary objectives.

Adam — momentum + RMSProp + bias correction

The workhorse of deep learning
$$\begin{aligned} m_k &= \beta_1 m_{k-1} + (1-\beta_1) g_k &\;&(\text{first moment})\\ v_k &= \beta_2 v_{k-1} + (1-\beta_2) g_k \odot g_k &\;&(\text{second moment})\\ \hat m_k &= m_k / (1 - \beta_1^k),\quad \hat v_k = v_k / (1-\beta_2^k) &\;&(\text{bias correction})\\ \theta_{k+1} &= \theta_k - \alpha\, \hat m_k / (\sqrt{\hat v_k} + \varepsilon) \end{aligned}$$

Defaults (Kingma & Ba, 2014): α = 10⁻³, β₁ = 0.9, β₂ = 0.999, ε = 10⁻⁸. These rarely need tuning.

Interactive: Racing the Optimizers

All five run simultaneously on the same cost surface with reasonable defaults. Try Rosenbrock — momentum methods escape the banana-shaped valley; vanilla GD does not.

Optimizer Raceiter = 0
GD Momentum Nesterov RMSProp Adam
Cost surface
Learning rate α0.010
Iterations200

AdamW — decoupled weight decay

Standard Adam with L2 regularization is subtly broken: the regularizer gets rescaled by √v̂, which makes it inconsistent. AdamW applies weight decay directly to the parameters:

$$\theta_{k+1} = \theta_k - \alpha\bigl(\hat m_k / (\sqrt{\hat v_k} + \varepsilon) + \lambda\, \theta_k\bigr)$$

AdamW is the modern default for Transformers and most deep-learning tasks.

Newton & Quasi-Newton Methods

First-order methods use only ∇J. Second-order methods additionally exploit curvature H = ∇²J, giving a dramatically better step direction.

Newton's Method

$$\theta_{k+1} = \theta_k - H_k^{-1}\nabla J(\theta_k)$$

Near the optimum, converges quadratically: the error squares each iteration. For a quadratic cost, Newton reaches the optimum in one step. The price is O(n³) per iteration to factor the Hessian — unaffordable for large n.

Gauss–Newton for NLS

When J(θ) = ½‖r(θ)‖² (nonlinear least squares), drop the second-derivative term of H:

$$H \approx J_r^\top J_r, \quad \theta_{k+1} = \theta_k - (J_r^\top J_r)^{-1} J_r^\top r$$

where J_r is the Jacobian of the residuals. Cheaper and often more stable than full Newton.

Levenberg–Marquardt

Adaptive blend of Gauss–Newton and gradient descent:

$$(J_r^\top J_r + \mu I)\,\Delta\theta = -J_r^\top r$$

μ small → Gauss–Newton (fast local convergence). μ large → gradient descent (robust). Trust-region algorithms adjust μ based on step quality. The go-to method for nonlinear LS in practice (used by every curve-fitting toolbox).

Quasi-Newton: BFGS & L-BFGS

Build an approximation B_k ≈ H_k from gradient history using rank-1 or rank-2 updates — avoiding Hessian computation entirely.

BFGS update (symmetric rank-2)
$$B_{k+1} = B_k + \frac{y_k y_k^\top}{y_k^\top s_k} - \frac{B_k s_k s_k^\top B_k}{s_k^\top B_k s_k}$$
where $s_k = \theta_{k+1} - \theta_k$, $y_k = g_{k+1} - g_k$

L-BFGS (Limited-memory BFGS) stores only the last m vector pairs (m ≈ 5–20). Memory O(mn) instead of O(n²). For moderate-size convex problems, L-BFGS is often the best choice.

When is second-order worth it?

  • n < 10³ — Newton/BFGS practical, often superior.
  • n ~ 10³ to 10⁵ — L-BFGS shines, especially for smooth convex problems.
  • n > 10⁶ (deep learning) — Hessian too big; stick with Adam or use K-FAC/Shampoo (block-diagonal approximations).
  • Non-convex, noisy gradients → second-order gives no advantage and can be unstable.

Rules of Thumb

Choosing an Optimizer — A Decision Tree

❯ Is the model linear in parameters with a quadratic cost?
YES → Use closed-form LS (or QR/SVD if ill-conditioned). Don't iterate.
❯ Is data streaming / are parameters time-varying?
YES → RLS with forgetting factor λ ∈ [0.95, 0.99].
❯ Is the problem nonlinear least squares?
YES → Levenberg–Marquardt. It's what lsqcurvefit and scipy.optimize.least_squares call.
❯ Is n moderate (< 10⁴) and cost smooth/convex?
YES → L-BFGS. Typically 10–100× fewer iterations than SGD.
❯ Is it deep learning / non-convex / large n?
Transformers, large models → AdamW (α=3e-4, wd=0.01)
CNNs, small models → SGD + momentum (α=0.1, β=0.9, cosine schedule)
Sparse features (NLP embeddings) → AdaGrad or Adam

Starting Learning Rates

MethodDefault αSweet-spot rangeFailure signal
SGD (no momentum)0.0110⁻³ – 10⁻¹Loss oscillates → decay
SGD + momentum0.110⁻² – 1.0NaNs → cut by 10×
Adam / AdamW10⁻³10⁻⁴ – 5·10⁻³Validation diverges → wd or smaller α
RMSProp10⁻³10⁻⁴ – 10⁻²Same as Adam
L-BFGSα=1.0Line search picksRare; usually a bad gradient

Batch Size — What Actually Matters

Batch sizeNoiseThroughputBest LR (linear scaling)Notes
1 (pure SGD)HighestLowestSmallestActs as implicit regularizer
32 – 256ModerateGood GPU utilization1× – 4× baseDeep-learning default
512 – 8192LowFastest wall-clockNeeds warmupLarge-scale training
Full batchNonePer-sample fastestSame as GDLosses generalization benefits

Linear scaling rule (Goyal et al., 2017): when multiplying batch size by k, multiply LR by k — and add a warmup period.

Debugging Convergence

Loss is NaN / Inf

Cause: LR too high, gradient explosion, log(0), division by zero.
Fix: Gradient clipping (clip-by-norm, typically 1.0), reduce α by 10×, add small ε inside logs/sqrts, check for mixed precision overflow.

Loss decreases then plateaus high

Cause: LR too high (noise ball), saddle point, or model capacity too low.
Fix: LR decay, momentum ≥ 0.9, or bigger model. Check training vs validation gap.

Loss decreases too slowly

Cause: LR too low, vanishing gradients, or ill-conditioning.
Fix: LR warmup, batch norm / layer norm, residual connections, better init.

Train fine, validation terrible

Cause: Overfitting — not an optimizer problem.
Fix: Weight decay (AdamW), dropout, data augmentation, early stopping.

RLS covariance explodes

Cause: λ too small, loss of excitation, ill-conditioned regressor.
Fix: Directional forgetting, bound tr(P), UD factorization, raise λ toward 1.

LS gives wild coefficients

Cause: Multicollinearity, cond(X) huge.
Fix: Ridge (λI), drop correlated features, center + scale, use SVD pseudo-inverse.

Numerical Stability — Must-Know Tricks

Glossary of Key Terms

Cost / Loss / Objective
The scalar function J(θ) to be minimized. "Loss" usually per-sample; "cost" usually averaged; "objective" most general.
Gradient ∇J
Vector of partial derivatives. Points in the direction of steepest ascent — so we step opposite to it.
Hessian ∇²J
Matrix of second partial derivatives. Encodes local curvature. Symmetric; positive-definite at a strict local minimum.
Jacobian
Matrix of partial derivatives for vector-valued functions. For residuals r(θ) ∈ ℝ^m, J_r ∈ ℝ^{m×n}.
Condition number κ
Ratio λ_max/λ_min of Hessian eigenvalues. Large κ → elongated contours → slow convergence for first-order methods.
Lipschitz constant L
Smallest L such that ‖∇J(x) - ∇J(y)‖ ≤ L‖x - y‖. Bounds how fast the gradient can change; sets the largest safe step size.
Strong convexity μ
The function grows at least as fast as a quadratic with curvature μ. J(y) ≥ J(x) + ∇J(x)ᵀ(y-x) + μ/2‖y-x‖².
Convex / Non-convex
Convex: any local min is global. Non-convex: local minima, saddles, plateaus. Deep nets are non-convex but empirically work.
Saddle point
Point where ∇J = 0 but the Hessian is indefinite. Problematic for second-order methods; escaped naturally by SGD's noise.
Learning rate α
Step size in first-order updates. Single most impactful hyperparameter.
Momentum β
EMA coefficient for past gradients. β = 0.9 ⇒ effective memory ~10 steps.
Forgetting factor λ
RLS exponential discount on past samples. Effective memory ≈ 1/(1-λ).
Regressor / Design matrix X
Rows are feature vectors φ(x_i)ᵀ. Appears in the LS/RLS normal equations.
Residual
Per-sample error r_i = y_i − ŷ_i. LS minimizes Σ r_i².
Regularization
Penalty added to the cost to bias toward simple solutions. L2 (ridge) shrinks; L1 (lasso) sparsifies; elastic-net combines both.
Bias–variance tradeoff
Simple models: high bias, low variance. Complex models: low bias, high variance. Regularization tunes the balance.
Persistent excitation
Input rich enough that the regressor spans all parameter directions over any finite window. Required for RLS to identify the full parameter vector.
Convergence rate
How quickly ‖θ_k − θ★‖ → 0. Linear: factor γ < 1 each step (e.g. GD on strongly convex). Quadratic: error squares each step (Newton). Sublinear: O(1/k) (GD on convex non-strong), O(1/√k) (SGD).
Stationary point
Where ∇J = 0. Minima, maxima, and saddles all qualify.
Epoch / Iteration
Epoch: one full pass over the dataset. Iteration: one parameter update (one mini-batch).
Batch size |B|
Number of samples averaged per gradient estimate. Trades gradient noise for throughput.
Backtracking line search
Given a direction, shrink the step size until Armijo condition J(θ + αd) ≤ J(θ) + c·α·∇Jᵀd holds (c ≈ 10⁻⁴).
Trust region
Constrain the step to ‖Δθ‖ ≤ Δ, where Δ adapts based on how well the model predicted the true decrease. Used in Levenberg–Marquardt.
Gradient clipping
If ‖g‖ > τ, rescale g ← g·τ/‖g‖. Prevents a single bad batch from blowing up training.
Weight decay
In decoupled form (AdamW): θ ← (1 − α·λ)θ − α·update. Equivalent to L2 penalty only for plain SGD.
Bias correction (Adam)
Divide m_k, v_k by 1 − β^k so the early-step estimates aren't biased toward zero initialization.
Warmup
Linearly ramp LR from 0 to target over the first few hundred steps. Stabilizes training with large batches.

One-page Cheat Sheet

LS · batch closed-form

$$\hat\theta = (X^\top X + \lambda I)^{-1}X^\top y$$

Use: linear-in-parameters + quadratic cost. Solve with: QR or SVD — never inv().

RLS · recursive update

$$K = \tfrac{P\phi}{\lambda+\phi^\top P\phi},\; \theta \mathrel{+}= K(y-\phi^\top\theta),\; P = \tfrac{P-K\phi^\top P}{\lambda}$$

Defaults: λ = 0.97, P₀ = 10³·I. Watch for windup.

GD

$$\theta \mathrel{-}= \alpha\, \nabla J$$

Max α: 2/L. Best α: 2/(L+μ). Rate: linear if strongly convex.

SGD (mini-batch)

$$\theta \mathrel{-}= \alpha_k\, \hat g,\quad \alpha_k \propto 1/\sqrt{k}$$

|B|: 32–256 typical. Need decay to actually converge.

Momentum

$$v \gets \beta v + g;\quad \theta \mathrel{-}= \alpha v$$

β: 0.9 default, 0.99 for extreme valleys.

Adam

$$\theta \mathrel{-}= \alpha \hat m / (\sqrt{\hat v} + \varepsilon)$$

Defaults: α=10⁻³, β₁=0.9, β₂=0.999, ε=10⁻⁸. Use AdamW if you want weight decay.

Newton

$$\theta \mathrel{-}= H^{-1}\nabla J$$

Cost: O(n³)/iter. Rate: quadratic. Only for small n.

L-BFGS

Hessian-free quasi-Newton with O(mn) memory. Best for: smooth, moderate-n, convex or mildly non-convex. Often 10–100× faster than SGD when applicable.

The 80/20 of optimization

If you remember five things:

  1. Scale your features. Nothing else matters if the Hessian is pathological.
  2. Tune the learning rate first — by far the most impactful knob.
  3. Use momentum. Free lunch, no downside.
  4. Decay the LR. A constant α cannot reach the optimum in SGD.
  5. Match the method to the problem. LS for quadratic; Adam for deep nets; LM for nonlinear LS; RLS for online ID.