Optimization Methods — Interactive Reference

01 · Foundations

The Optimization Problem

Nearly every fitting, learning, and control problem can be written as:

The canonical form

$$\theta^\star = \arg\min_{\theta \in \mathbb{R}^n}\; J(\theta)$$

where θ is the parameter vector we are free to choose and J(θ) is the cost (or loss, or objective). The entire zoo of methods differs only in how they search for θ★.

Taxonomy

Closed-form

Analytical

Solve ∇J(θ) = 0 directly. Works when the problem is linear in parameters and the cost is quadratic. Examples: Least Squares, Wiener filter.

First-order

Gradient-based

Move in the direction of steepest descent. Scales to millions of parameters. Examples: GD, SGD, Adam, AdaGrad.

Second-order

Curvature-aware

Use the Hessian (or an approximation) for better step direction. Examples: Newton, Gauss–Newton, Levenberg–Marquardt, BFGS.

Why so many methods?

Three properties of the problem determine which algorithm you should reach for:

Property	What it means	Affects
Convexity	Does the cost have one global minimum or many local ones?	Whether first-order methods converge to the global optimum.
Conditioning	Ratio of largest to smallest curvature (the condition number `κ`).	Convergence speed. Ill-conditioned problems crawl.
Data access	All samples at once (batch) or one-at-a-time (streaming)?	Choice between LS / RLS, GD / SGD.
Dimensionality	How large is `n`?	Second-order methods cost `O(n³)`; first-order methods `O(n)`.
Noise	Is `∇J` exact or stochastic?	Need variance-reducing tricks (momentum, mini-batches).

Mental model

Optimization = descending a mountain blindfolded. You can feel the slope under your feet (gradient), sometimes you can feel the curvature (Hessian), and in special cases (quadratic bowls) you can solve for the bottom analytically.

02 · Closed-form

Least Squares

The oldest trick in the book (Gauss, 1795). Given data pairs (x_i, y_i), find parameters θ of a model ŷ = φ(x)ᵀ θ that minimize the sum of squared residuals:

Cost function

$$J_{\text{LS}}(\theta) = \tfrac{1}{2}\sum_{i=1}^N \bigl(y_i - \phi(x_i)^\top \theta\bigr)^2 = \tfrac{1}{2}\|y - X\theta\|_2^2$$

Stacking φ(x_i)ᵀ as rows of the regressor matrix X ∈ ℝ^{N×n}, setting ∇J = 0 gives the normal equations:

Normal equations · closed-form solution

$$X^\top X\, \theta = X^\top y \quad\Longrightarrow\quad \boxed{\hat\theta = (X^\top X)^{-1}X^\top y}$$

Interactive: Polynomial Fitting

Click on the plot to place noisy data points. Change the polynomial degree and watch the fit — and the condition number of XᵀX — change. Overfit with degree 10 to see what happens.

fitted curve data points residuals

Weighted & Regularized Variants

Weighted LS (WLS)

When samples have unequal reliability (e.g. varying sensor noise), weight them with W = diag(w_i):

$$\hat\theta = (X^\top W X)^{-1} X^\top W y$$

Rule of thumb: w_i = 1/σ_i² gives the maximum-likelihood estimate under Gaussian noise.

Ridge Regression (L2)

Add λ‖θ‖² to the cost. Shrinks parameters, stabilizes ill-conditioned XᵀX:

$$\hat\theta = (X^\top X + \lambda I)^{-1} X^\top y$$

Try it in the widget above — large λ makes the polynomial smoother.

Lasso (L1)

Penalty λ‖θ‖₁ drives coefficients to exactly zero → sparse solutions for feature selection. No closed form; solved via coordinate descent or ISTA.

Total LS (TLS)

Accounts for noise in both X and y (errors-in-variables). Solved via SVD — pick the right-singular vector of [X y] corresponding to the smallest singular value.

Numerical pitfall · do NOT invert XᵀX

Forming XᵀX squares the condition number. For ill-conditioned regressors, use QR decomposition (X = QR, solve Rθ = Qᵀy) or SVD. In Python: np.linalg.lstsq. In MATLAB: the backslash operator X\y.

Practitioner's checklist

✓ Always center and scale your features before fitting.
✓ Check cond(X) — if > 10⁸, you need ridge or fewer features.
✓ Quadratic cost + linear model → use LS. Don't iterate.
✓ For real-time estimation use RLS (next section), not re-fitting.

03 · Online Estimation

Recursive Least Squares

Batch LS requires all N samples upfront. In control, adaptive filtering, and online identification we instead want to update θ̂ as each new sample arrives — and optionally forget old data so we track time-varying parameters.

The Update Equations

Given a new sample (φ_k, y_k), forgetting factor λ ∈ (0,1], and previous estimate θ̂_{k-1} with covariance P_{k-1}:

RLS · three-line update

$$\begin{aligned} K_k &= \frac{P_{k-1}\,\phi_k}{\lambda + \phi_k^\top P_{k-1}\,\phi_k} &&\text{(gain vector)}\\ \hat\theta_k &= \hat\theta_{k-1} + K_k\bigl(y_k - \phi_k^\top \hat\theta_{k-1}\bigr) &&\text{(parameter update)}\\ P_k &= \tfrac{1}{\lambda}\bigl(P_{k-1} - K_k\,\phi_k^\top P_{k-1}\bigr) &&\text{(covariance update)} \end{aligned}$$

Start with P_0 = α I for large α (≈ 10³–10⁶) to express initial uncertainty, and θ̂_0 = 0 or a best guess.

Interactive: Tracking a Time-Varying System

The true parameter a(t) drifts over time. Batch LS is trapped by stale data; RLS with a forgetting factor can track the change. Play with λ: smaller = faster tracking but noisier.

true θ(t) RLS estimate batch LS P_k (trace, lower panel)

Tuning the forgetting factor

λ	Effective memory (≈ 1/(1-λ))	Behavior	Typical use
1.000	∞	Pure integrator — converges to batch LS	Stationary parameters
0.999	1000 samples	Very slow forgetting, low variance	Slowly drifting physical constants
0.99	100 samples	Default for most systems	General-purpose
0.95	20 samples	Fast tracking, noisier	Rapid disturbances
< 0.9	< 10	Covariance can explode (windup)	Use with caution + UD factorization

Covariance windup

If the regressor φ_k stops exciting all directions (e.g., constant input), P_k grows unboundedly along unexcited directions. When an excitation finally arrives, the estimate can jump wildly. Fixes: directional forgetting, bounded-gain RLS, or the exponentially-weighted covariance reset.

Persistent Excitation

For RLS to identify all parameters, the regressor must be persistently exciting: roughly, it must span ℝ^n over any finite window. In practice — inject a small dither signal or use naturally rich inputs (PRBS, swept sinusoids).

04 · First-order

Gradient Descent

When the cost is not quadratic in θ — or when XᵀX is too large to invert — solve iteratively by moving in the direction of steepest descent:

The gradient descent update

$$\theta_{k+1} = \theta_k - \alpha\,\nabla J(\theta_k)$$

The single hyperparameter is the learning rate (or step size) α. Too small → crawl. Too large → diverge. Just right → convergence, but at what rate?

Convergence Rates (essentials)

Problem class	Best α	Iterations to ε	Rate
Strongly convex, smooth	`2/(L+μ)`	`O(κ·log(1/ε))`	Linear
Convex, smooth	`1/L`	`O(1/ε)`	Sublinear
Non-convex, smooth	`1/L`	`O(1/ε²)` to ∇J≈0	Sublinear to stationarity

Here L is the Lipschitz constant of ∇J, μ is the strong-convexity constant, and κ = L/μ is the condition number. Ill-conditioned problems converge slowly because κ is large.

Interactive: GD on 2D Cost Surfaces

Watch how step size and curvature interact. The quadratic bowl with adjustable eccentricity directly shows the condition-number effect; Rosenbrock shows how narrow valleys trap vanilla GD.

Click the contour plot to set a new starting point

Cost surface

Condition κ10

Learning rate α0.050

Iterations80

final θ—
final J—
‖∇J‖—

The learning-rate sweet spot

For a quadratic with Hessian eigenvalues λ_max = L, λ_min = μ:

α < 2/L → convergence guaranteed
α = 2/(L+μ) → optimal rate
α > 2/L → divergence

In practice you estimate L via line search or backtracking (Armijo rule).

05 · Stochastic

Stochastic & Mini-batch Gradient Descent

Real-world costs are almost always a sum over data:

$$J(\theta) = \frac{1}{N}\sum_{i=1}^N J_i(\theta)$$

Computing the full gradient ∇J costs O(N) per step. With N in the millions, that is prohibitive. The remedy:

Batch

Full-batch GD

$$\nabla J = \tfrac{1}{N}\sum_{i}\nabla J_i$$

Exact gradient. Slow per step, smooth convergence. Used when N is tiny.

Stochastic

SGD

$$\hat\nabla J = \nabla J_{i_k},\; i_k \sim \mathcal U(1,N)$$

Unbiased but noisy gradient. N× faster per step — at the price of variance.

Mini-batch

Mini-batch SGD

$$\hat\nabla J = \tfrac{1}{|B|}\sum_{i\in B}\nabla J_i$$

The standard for deep learning. |B| = 32, 64, …, 512 balances noise vs throughput.

Interactive: Linear Regression — Full vs Mini-batch vs SGD

Same problem, same learning rate, three trajectories. Note how SGD converges in expectation but oscillates around the optimum; full-batch marches straight there but one step takes many times more FLOPs.

Learning-rate schedules

Constant α causes SGD to oscillate forever (variance of gradient estimate never vanishes). Decay the step size to reach the true optimum:

Schedule	Formula	Use case
Inverse time	`α_k = α₀ / (1 + kτ)`	Theoretical guarantees (Robbins–Monro)
Step decay	`α_k = α₀ · γ^⌊k/s⌋`	Classic "÷ 10 every 30 epochs"
Cosine	`α_k = α_min + ½(α₀ - α_min)(1+cos(πk/K))`	Modern deep learning default
Warmup + decay	Linear ramp → cosine	Transformers, large batches
OneCycle	Triangular up then down	Fast-AI, super-convergence

Robbins–Monro conditions

For SGD to converge almost-surely to a stationary point, the step sizes must satisfy:

$$\sum_k \alpha_k = \infty,\qquad \sum_k \alpha_k^2 < \infty$$

Example: α_k = 1/k qualifies. Constant α does not — it gives only a noise ball of radius O(√α).

06 · Modern Optimizers

Momentum, RMSProp, Adam

Vanilla GD ignores past gradients and has no sense of per-parameter scale. Three ideas — momentum, adaptive learning rates, and their combination — dominate deep learning.

Momentum (Polyak heavy ball)

Average recent gradients to build velocity through narrow valleys:

$$\begin{aligned} v_{k+1} &= \beta\, v_k + \nabla J(\theta_k)\\ \theta_{k+1} &= \theta_k - \alpha\, v_{k+1} \end{aligned}$$

Typical β = 0.9. This turns first-order descent into a damped harmonic oscillator — the step in each eigendirection is amplified by a factor 1/(1-β) when gradients are consistent, and damped when they oscillate.

Nesterov Accelerated Gradient (NAG)

Lookahead trick: evaluate the gradient at where you'd be after a momentum step.

$$\theta_{k+1} = \theta_k - \alpha \nabla J(\theta_k + \beta v_k) + \beta v_{k+1}$$

Gives a provably better O(1/k²) rate on smooth convex problems (vs O(1/k) for plain GD).

AdaGrad — per-parameter rescaling

$$\begin{aligned} G_k &= G_{k-1} + g_k \odot g_k\\ \theta_{k+1} &= \theta_k - \alpha\, g_k / (\sqrt{G_k} + \varepsilon) \end{aligned}$$

Parameters with historically large gradients get smaller steps; rare-feature parameters get larger steps. Excellent for sparse problems. Weakness: G_k only grows, so the effective learning rate decays to zero — deep nets stall.

RMSProp — fix AdaGrad's decay

$$\begin{aligned} G_k &= \gamma G_{k-1} + (1-\gamma)\, g_k \odot g_k\\ \theta_{k+1} &= \theta_k - \alpha\, g_k / (\sqrt{G_k} + \varepsilon) \end{aligned}$$

Use an EMA (γ ≈ 0.9) instead of a cumulative sum. Now adaptive-LR works even on non-stationary objectives.

Adam — momentum + RMSProp + bias correction

The workhorse of deep learning

$$\begin{aligned} m_k &= \beta_1 m_{k-1} + (1-\beta_1) g_k &\;&(\text{first moment})\\ v_k &= \beta_2 v_{k-1} + (1-\beta_2) g_k \odot g_k &\;&(\text{second moment})\\ \hat m_k &= m_k / (1 - \beta_1^k),\quad \hat v_k = v_k / (1-\beta_2^k) &\;&(\text{bias correction})\\ \theta_{k+1} &= \theta_k - \alpha\, \hat m_k / (\sqrt{\hat v_k} + \varepsilon) \end{aligned}$$

Defaults (Kingma & Ba, 2014): α = 10⁻³, β₁ = 0.9, β₂ = 0.999, ε = 10⁻⁸. These rarely need tuning.

Interactive: Racing the Optimizers

All five run simultaneously on the same cost surface with reasonable defaults. Try Rosenbrock — momentum methods escape the banana-shaped valley; vanilla GD does not.

GD Momentum Nesterov RMSProp Adam

AdamW — decoupled weight decay

Standard Adam with L2 regularization is subtly broken: the regularizer gets rescaled by √v̂, which makes it inconsistent. AdamW applies weight decay directly to the parameters:

$$\theta_{k+1} = \theta_k - \alpha\bigl(\hat m_k / (\sqrt{\hat v_k} + \varepsilon) + \lambda\, \theta_k\bigr)$$

AdamW is the modern default for Transformers and most deep-learning tasks.

07 · Second-order

Newton & Quasi-Newton Methods

First-order methods use only ∇J. Second-order methods additionally exploit curvature H = ∇²J, giving a dramatically better step direction.

Newton's Method

$$\theta_{k+1} = \theta_k - H_k^{-1}\nabla J(\theta_k)$$

Near the optimum, converges quadratically: the error squares each iteration. For a quadratic cost, Newton reaches the optimum in one step. The price is O(n³) per iteration to factor the Hessian — unaffordable for large n.

Gauss–Newton for NLS

When J(θ) = ½‖r(θ)‖² (nonlinear least squares), drop the second-derivative term of H:

$$H \approx J_r^\top J_r, \quad \theta_{k+1} = \theta_k - (J_r^\top J_r)^{-1} J_r^\top r$$

where J_r is the Jacobian of the residuals. Cheaper and often more stable than full Newton.

Levenberg–Marquardt

Adaptive blend of Gauss–Newton and gradient descent:

$$(J_r^\top J_r + \mu I)\,\Delta\theta = -J_r^\top r$$

μ small → Gauss–Newton (fast local convergence). μ large → gradient descent (robust). Trust-region algorithms adjust μ based on step quality. The go-to method for nonlinear LS in practice (used by every curve-fitting toolbox).

Quasi-Newton: BFGS & L-BFGS

Build an approximation B_k ≈ H_k from gradient history using rank-1 or rank-2 updates — avoiding Hessian computation entirely.

BFGS update (symmetric rank-2)

$$B_{k+1} = B_k + \frac{y_k y_k^\top}{y_k^\top s_k} - \frac{B_k s_k s_k^\top B_k}{s_k^\top B_k s_k}$$

where $s_k = \theta_{k+1} - \theta_k$, $y_k = g_{k+1} - g_k$

L-BFGS (Limited-memory BFGS) stores only the last m vector pairs (m ≈ 5–20). Memory O(mn) instead of O(n²). For moderate-size convex problems, L-BFGS is often the best choice.

When is second-order worth it?

n < 10³ — Newton/BFGS practical, often superior.
n ~ 10³ to 10⁵ — L-BFGS shines, especially for smooth convex problems.
n > 10⁶ (deep learning) — Hessian too big; stick with Adam or use K-FAC/Shampoo (block-diagonal approximations).
Non-convex, noisy gradients → second-order gives no advantage and can be unstable.

08 · Practical

Rules of Thumb

Choosing an Optimizer — A Decision Tree

❯ Is the model linear in parameters with a quadratic cost?

YES → Use closed-form LS (or QR/SVD if ill-conditioned). Don't iterate.

❯ Is data streaming / are parameters time-varying?

YES → RLS with forgetting factor λ ∈ [0.95, 0.99].

❯ Is the problem nonlinear least squares?

YES → Levenberg–Marquardt. It's what lsqcurvefit and scipy.optimize.least_squares call.

❯ Is n moderate (< 10⁴) and cost smooth/convex?

YES → L-BFGS. Typically 10–100× fewer iterations than SGD.

❯ Is it deep learning / non-convex / large n?

Transformers, large models → AdamW (α=3e-4, wd=0.01)

CNNs, small models → SGD + momentum (α=0.1, β=0.9, cosine schedule)

Sparse features (NLP embeddings) → AdaGrad or Adam

Starting Learning Rates

Method	Default α	Sweet-spot range	Failure signal
SGD (no momentum)	0.01	10⁻³ – 10⁻¹	Loss oscillates → decay
SGD + momentum	0.1	10⁻² – 1.0	NaNs → cut by 10×
Adam / AdamW	10⁻³	10⁻⁴ – 5·10⁻³	Validation diverges → wd or smaller α
RMSProp	10⁻³	10⁻⁴ – 10⁻²	Same as Adam
L-BFGS	α=1.0	Line search picks	Rare; usually a bad gradient

Batch Size — What Actually Matters

Batch size	Noise	Throughput	Best LR (linear scaling)	Notes
1 (pure SGD)	Highest	Lowest	Smallest	Acts as implicit regularizer
32 – 256	Moderate	Good GPU utilization	1× – 4× base	Deep-learning default
512 – 8192	Low	Fastest wall-clock	Needs warmup	Large-scale training
Full batch	None	Per-sample fastest	Same as GD	Losses generalization benefits

Linear scaling rule (Goyal et al., 2017): when multiplying batch size by k, multiply LR by k — and add a warmup period.

Debugging Convergence

Loss is NaN / Inf

Cause: LR too high, gradient explosion, log(0), division by zero.
Fix: Gradient clipping (clip-by-norm, typically 1.0), reduce α by 10×, add small ε inside logs/sqrts, check for mixed precision overflow.

Loss decreases then plateaus high

Cause: LR too high (noise ball), saddle point, or model capacity too low.
Fix: LR decay, momentum ≥ 0.9, or bigger model. Check training vs validation gap.

Loss decreases too slowly

Cause: LR too low, vanishing gradients, or ill-conditioning.
Fix: LR warmup, batch norm / layer norm, residual connections, better init.

Train fine, validation terrible

Cause: Overfitting — not an optimizer problem.
Fix: Weight decay (AdamW), dropout, data augmentation, early stopping.

RLS covariance explodes

Cause: λ too small, loss of excitation, ill-conditioned regressor.
Fix: Directional forgetting, bound tr(P), UD factorization, raise λ toward 1.

LS gives wild coefficients

Cause: Multicollinearity, cond(X) huge.
Fix: Ridge (λI), drop correlated features, center + scale, use SVD pseudo-inverse.

Numerical Stability — Must-Know Tricks

Log-sum-exp: log Σ exp(x_i) = m + log Σ exp(x_i − m) with m = max(x_i). Avoids overflow.
Softmax: subtract the max before exponentiating.
Matrix inversion: prefer solve() over inv(). Prefer Cholesky for SPD matrices, QR/SVD otherwise.
Gradient clipping: g ← g · min(1, τ/‖g‖). Standard τ = 1.0 for RNNs, 5.0 for Transformers.
Denormal numbers: flush-to-zero on embedded / FPGA targets to avoid 100× slowdowns.
Feature scaling: not cosmetic — ill-scaled inputs make the Hessian ill-conditioned and cripple any first-order method.

09 · Reference

Glossary of Key Terms

Cost / Loss / Objective: The scalar function J(θ) to be minimized. "Loss" usually per-sample; "cost" usually averaged; "objective" most general.
Gradient ∇J: Vector of partial derivatives. Points in the direction of steepest ascent — so we step opposite to it.
Hessian ∇²J: Matrix of second partial derivatives. Encodes local curvature. Symmetric; positive-definite at a strict local minimum.
Jacobian: Matrix of partial derivatives for vector-valued functions. For residuals r(θ) ∈ ℝ^m, J_r ∈ ℝ^{m×n}.
Condition number κ: Ratio λ_max/λ_min of Hessian eigenvalues. Large κ → elongated contours → slow convergence for first-order methods.
Lipschitz constant L: Smallest L such that ‖∇J(x) - ∇J(y)‖ ≤ L‖x - y‖. Bounds how fast the gradient can change; sets the largest safe step size.
Strong convexity μ: The function grows at least as fast as a quadratic with curvature μ. J(y) ≥ J(x) + ∇J(x)ᵀ(y-x) + μ/2‖y-x‖².
Convex / Non-convex: Convex: any local min is global. Non-convex: local minima, saddles, plateaus. Deep nets are non-convex but empirically work.
Saddle point: Point where ∇J = 0 but the Hessian is indefinite. Problematic for second-order methods; escaped naturally by SGD's noise.
Learning rate α: Step size in first-order updates. Single most impactful hyperparameter.
Momentum β: EMA coefficient for past gradients. β = 0.9 ⇒ effective memory ~10 steps.
Forgetting factor λ: RLS exponential discount on past samples. Effective memory ≈ 1/(1-λ).
Regressor / Design matrix X: Rows are feature vectors φ(x_i)ᵀ. Appears in the LS/RLS normal equations.
Residual: Per-sample error r_i = y_i − ŷ_i. LS minimizes Σ r_i².
Regularization: Penalty added to the cost to bias toward simple solutions. L2 (ridge) shrinks; L1 (lasso) sparsifies; elastic-net combines both.
Bias–variance tradeoff: Simple models: high bias, low variance. Complex models: low bias, high variance. Regularization tunes the balance.
Persistent excitation: Input rich enough that the regressor spans all parameter directions over any finite window. Required for RLS to identify the full parameter vector.
Convergence rate: How quickly ‖θ_k − θ★‖ → 0. Linear: factor γ < 1 each step (e.g. GD on strongly convex). Quadratic: error squares each step (Newton). Sublinear: O(1/k) (GD on convex non-strong), O(1/√k) (SGD).
Stationary point: Where ∇J = 0. Minima, maxima, and saddles all qualify.
Epoch / Iteration: Epoch: one full pass over the dataset. Iteration: one parameter update (one mini-batch).
Batch size |B|: Number of samples averaged per gradient estimate. Trades gradient noise for throughput.
Backtracking line search: Given a direction, shrink the step size until Armijo condition J(θ + αd) ≤ J(θ) + c·α·∇Jᵀd holds (c ≈ 10⁻⁴).
Trust region: Constrain the step to ‖Δθ‖ ≤ Δ, where Δ adapts based on how well the model predicted the true decrease. Used in Levenberg–Marquardt.
Gradient clipping: If ‖g‖ > τ, rescale g ← g·τ/‖g‖. Prevents a single bad batch from blowing up training.
Weight decay: In decoupled form (AdamW): θ ← (1 − α·λ)θ − α·update. Equivalent to L2 penalty only for plain SGD.
Bias correction (Adam): Divide m_k, v_k by 1 − β^k so the early-step estimates aren't biased toward zero initialization.
Warmup: Linearly ramp LR from 0 to target over the first few hundred steps. Stabilizes training with large batches.

10 · TL;DR

One-page Cheat Sheet

LS · batch closed-form

$$\hat\theta = (X^\top X + \lambda I)^{-1}X^\top y$$

Use: linear-in-parameters + quadratic cost. Solve with: QR or SVD — never inv().

RLS · recursive update

$$K = \tfrac{P\phi}{\lambda+\phi^\top P\phi},\; \theta \mathrel{+}= K(y-\phi^\top\theta),\; P = \tfrac{P-K\phi^\top P}{\lambda}$$

Defaults: λ = 0.97, P₀ = 10³·I. Watch for windup.

GD

$$\theta \mathrel{-}= \alpha\, \nabla J$$

Max α: 2/L. Best α: 2/(L+μ). Rate: linear if strongly convex.

SGD (mini-batch)

$$\theta \mathrel{-}= \alpha_k\, \hat g,\quad \alpha_k \propto 1/\sqrt{k}$$

|B|: 32–256 typical. Need decay to actually converge.

Momentum

$$v \gets \beta v + g;\quad \theta \mathrel{-}= \alpha v$$

β: 0.9 default, 0.99 for extreme valleys.

Adam

$$\theta \mathrel{-}= \alpha \hat m / (\sqrt{\hat v} + \varepsilon)$$

Defaults: α=10⁻³, β₁=0.9, β₂=0.999, ε=10⁻⁸. Use AdamW if you want weight decay.

Newton

$$\theta \mathrel{-}= H^{-1}\nabla J$$

Cost: O(n³)/iter. Rate: quadratic. Only for small n.

L-BFGS

Hessian-free quasi-Newton with O(mn) memory. Best for: smooth, moderate-n, convex or mildly non-convex. Often 10–100× faster than SGD when applicable.

The 80/20 of optimization

If you remember five things:

Scale your features. Nothing else matters if the Hessian is pathological.
Tune the learning rate first — by far the most impactful knob.
Use momentum. Free lunch, no downside.
Decay the LR. A constant α cannot reach the optimum in SGD.
Match the method to the problem. LS for quadratic; Adam for deep nets; LM for nonlinear LS; RLS for online ID.