Neural Networks — Deep Dive Reference

By Majid Mazouchi

Companion to the interactive neural networks page. Each term below goes deeper: intuition, math, key variants, practical details, and the common pitfalls that go with them.

01 · Train / Validation / Test Splits

Three disjoint subsets of your data, each with a strictly different role. The core experimental protocol of supervised ML — get this wrong and nothing downstream is trustworthy.

The three roles

Training set

Used to fit weights via gradient descent. The model sees these labels directly.

Validation set

Used to tune hyperparameters (LR, depth, regularization, early-stop point, architecture choice). The model never trains on it, but you do make decisions based on it — so it gets "contaminated" by your choices over time.

Test set

Used once, at the very end, to produce the honest reported number. If you look at it and change anything, it's no longer a test set — it's another validation set.

Typical ratios

For medium datasets (10³–10⁵ samples): 70/15/15 or 80/10/10 are standard. For large datasets (>10⁶), the val and test sets can be proportionally tiny — 98/1/1 is fine because 10,000 test samples already give very tight confidence intervals. For small datasets, use k-fold cross-validation instead of a single held-out set.

Stratified splitting

For classification with imbalanced classes, random splitting can put all minority-class samples in the training set by chance. stratify=y in sklearn.train_test_split preserves class proportions across splits. Do this by default for any classification task.

Cross-validation

Instead of one fixed split, rotate through k folds:

k-fold CV — split into k equal chunks, train on k-1, validate on the remaining one, repeat k times, average results. k=5 and k=10 are typical.
Stratified k-fold — same, but preserves class balance per fold.
Time-series CV — expanding window or rolling window. Never shuffle time-series data across splits — it leaks the future into the past.
Group k-fold — when samples share an identity (same patient, same vehicle, same operating run), put all samples from one group entirely in one fold.
Nested CV — outer loop for honest test scoring, inner loop for hyperparameter tuning. The gold standard for small datasets but computationally expensive.

🚨 Leakage — the #1 silent killer

Feature leakage — including a feature that's a proxy for the label (e.g., a post-hoc status field). Trivially inflates metrics.
Temporal leakage — training on data from the future. Common in time-series: fitting a scaler on the full dataset leaks future statistics.
Duplicate leakage — near-duplicate samples in both train and test (same image under different file names, same waveform with a different timestamp).
Subject leakage — same person / vehicle / sensor appears in both train and test. The model learns the subject, not the task.
Preprocessing leakage — computing the mean/std on the whole dataset then splitting. The test-set statistics just leaked into training.

For motor control / automotive contexts

Operating conditions are correlated across splits by default. If your dataset contains 50 driving cycles, don't shuffle samples — split by cycle. If the data spans multiple vehicles or temperatures, stratify by those regimes, or evaluate held-out-regime generalization explicitly. A model that scores 99% on a random split can collapse at 60% on an out-of-regime split — that gap is the real story.

02 · Signal Preprocessing

Garbage in, garbage out. Every real-world signal needs cleaning before a model sees it. This is the stage where most projects silently succeed or fail.

The canonical pipeline

In order, for raw sensor data:

Anti-alias filter — before any downsampling. Otherwise high-frequency content aliases down into your band of interest.
DC removal / detrending — subtract the mean, a linear fit, or a slow-moving baseline (high-pass filter with very low cutoff).
Outlier detection and handling — z-score, IQR, Hampel, or model-based residuals.
Smoothing — Savitzky–Golay when peaks matter, Kalman when you have a model.
Normalization — z-score or min–max, fit on training data only.
Feature engineering — FFT bins, rolling statistics, domain-specific transforms (Park/Clarke for three-phase motors, envelope detection for bearings).
Regime labeling — tag each sample with operating-point metadata for stratified validation.

Normalization methods

Method	Formula	Robust to outliers	When to use
Z-score	(x − μ) / σ	No	Data is approximately Gaussian; default choice
Min–max	(x − min) / (max − min)	No	Bounded range required (e.g., image pixels to [0,1])
Robust	(x − median) / IQR	Yes	Heavy-tailed distributions; known outliers
Unit-norm	x / ‖x‖₂	No	Direction matters more than magnitude (cosine similarity)

Feature engineering — motor-control examples

Park transform — rotate three-phase (a,b,c) currents into a rotating (d,q) frame aligned with the rotor. The AC problem becomes a DC problem.
FFT-bin features — specific harmonic amplitudes (1×, 2×, 6× electrical order) are diagnostic for NVH and rotor eccentricity.
Envelope / Hilbert transform — for bearing fault detection, the amplitude envelope of high-frequency content reveals the bearing characteristic frequencies.
Rolling stats — 100-ms rolling mean, std, skew, kurtosis of a torque signal to detect transients.

🚨 Traps

Fitting a scaler on train+val combined leaks statistics from val.
Decimating before anti-aliasing produces irreversible aliasing artifacts.
Replacing outliers with the mean biases the distribution and can break downstream features (e.g., rolling std).
Normalizing per-sample when you should normalize per-feature (or vice versa) — know your axis conventions.

03 · Savitzky–Golay Filter

A local polynomial least-squares smoother that preserves peaks, amplitudes, and derivatives — what you want when the features of the signal carry information, not just its low-frequency trend.

How it works

For every output sample, SG takes a window of 2k+1 neighboring input samples, fits a polynomial of degree P ≤ 2k in a least-squares sense, and sets the output equal to the polynomial's value at the center of the window:

ŷ[i] = Σ_j=-k^k c_j · y[i + j]

where the coefficients c_j come from the first row of (AᵀA)⁻¹Aᵀ and A is the Vandermonde matrix whose rows are [1, j, j², …, j^P]. The magic is that for any fixed window size W and polynomial order P, the coefficients are constant — SG reduces to a FIR filter with a fixed impulse response.

Key parameters

Window size W

Must be odd. Larger W → more smoothing, more lag (if non-causal), more distortion near rapidly varying features. Typical: 5–51 for sensor data.

Polynomial order P

Must satisfy P < W. Higher P preserves more of the signal's shape but attenuates less noise. Typical: P = 2 or 3. P = 0 is just a moving average.

Derivative order

SG can also return the m^th derivative of the fitted polynomial — a smooth derivative estimate. This is how you differentiate a noisy position signal into a usable velocity signal.

SG vs moving average

A plain moving average is equivalent to SG with P = 0 — it fits a constant (the mean) to the window. It attenuates peaks because a peak is, by definition, deviation from a local mean. SG with P ≥ 2 fits curvature, so the output polynomial bends up at a peak the way the data does, and the peak survives.

Frequency response

SG's magnitude response is flatter near DC than a boxcar moving average and has milder ripple in the passband, but worse stopband rejection. If you need clean stopband rejection, a Butterworth or Chebyshev IIR is better. If you need to preserve waveform shape, SG wins.

Real-time use

The standard symmetric SG introduces W/2 samples of group delay — if that matters, you need a causal SG (window uses only past samples, polynomial extrapolated to the current sample). Causal SG has worse noise rejection but zero lag; it's a standard trick in motor control observers.

Edge handling

At the first and last W/2 samples, a symmetric window runs off the edge. Options: pad with the edge value (simple, biases the ends), mirror the signal (smooth but adds a spurious reflection), extrapolate via the local polynomial (best quality, more compute), or just return zeros at the edges (simplest, loses samples).

💡 In practice

Start with W=11, P=3 for general-purpose smoothing. Sweep to taste.
For peak detection followed by smoothing, use SG. For baseline removal, use a high-pass filter or a much longer SG window subtracted from the signal.
Implementations: scipy.signal.savgol_filter, MATLAB sgolayfilt. Both also support derivative output via deriv=.

04 · Outlier Removal

Not every weird sample is a problem — but some are sensor glitches, transcription errors, or events so rare they'll dominate a loss function. The art is distinguishing signal from noise.

Detection methods

Method	How it flags	Assumption	Robust to clusters?
Z-score	\|x − μ\| / σ > τ	Gaussian	No
IQR / Tukey	Outside [Q1 − 1.5·IQR, Q3 + 1.5·IQR]	Reasonably symmetric	Somewhat
Hampel	\|x − median\| / (1.4826·MAD) > τ	Symmetric, moderate tails	Yes
Isolation Forest	Average isolation depth in random trees	None (unsupervised)	Yes
LOF	Local density vs neighbors	Density-based	Yes
Model residuals	Large residual from a first-pass model	First model fits "normal" data	Yes

Z-score breakdown — why it fails on clusters

Consider 1000 clean samples around 0 plus 50 spike outliers near 10. The mean shifts to ~0.5 and the std inflates dramatically. Real outliers now have z-scores of ~3–4, marginal under a 3σ threshold. Meanwhile legitimate low points get flagged. This is the "breakdown point" problem — classical z-score has 0% breakdown. Hampel uses median and MAD, which have 50% breakdown — up to half your data can be outliers and the estimator still works.

Removal vs flagging vs winsorizing

Remove — drop the sample. Simple; loses information; changes the sample count.
Flag — keep the value, add a boolean "is_outlier" feature. The model can learn to use or ignore it.
Winsorize — clip to a threshold (e.g., the 99th percentile). Preserves sample count, bounds influence. Good default for heavy-tailed but real phenomena.
Impute — replace with the median or a model-based estimate. Use sparingly; can mask real signal.

Domain-aware handling

A current spike in a PMSM could be a sensor glitch or a real fault. A 500°C temperature reading on a silicon die is impossible; a 500°C reading on a combustion chamber is Tuesday. Always apply domain-specific bounds first (the "possible reading" filter), then statistical methods for whatever's left.

🚨 Traps

Computing outlier thresholds on the full dataset leaks test statistics into training.
Aggressive removal on small datasets creates biased estimates.
Setting a single global threshold when the noise level is regime-dependent (e.g., high-speed samples are intrinsically noisier) — use per-regime thresholds.
Calling "rare but real" events outliers and throwing them out. In fault detection, rare events are the whole point.

05 · Temperature Grouping (Regime Binning)

Physical systems are not i.i.d. across operating conditions. Binning by temperature — or any regime variable — lets you measure, validate, and train in a way that respects the underlying physics.

Why it matters

A dataset drawn from a motor operating at 20°C, 60°C, and 100°C mixes three physically distinct regimes: winding resistance R rises, magnet flux ψ_m drops, viscosity in bearings changes, sensor offsets drift. A single global model trained on the pool implicitly averages these effects. Its overall RMSE might be excellent while its per-regime error at 100°C is catastrophic — and you won't notice unless you bin.

Binning methods

Equal-width — fixed intervals (e.g., 20°C bins). Easiest to interpret; can leave bins empty.
Equal-frequency (quantile) — bins contain the same number of samples. Better statistical power per bin; bin edges move with the data.
K-means of the regime variable — data-driven clusters. Best when the regime variable has natural clusters.
Domain-defined — cold (< 40°C), nominal (40–80°C), hot (> 80°C). Aligns with engineering intuition and derating tables.

What to do with bins

Per-bin statistics — mean, std, min, max, distribution shape. Answers "is my training data balanced across conditions?"
Stratified train/val/test splits — ensure every bin is represented in every split, so test-set metrics are honest.
Stratified validation reporting — always report per-regime error alongside overall error.
Regime as a feature — one-hot encode the bin (or feed the continuous regime variable directly). Lets the model condition its predictions on the regime.
Specialist models per bin — mixture-of-experts style. Train a dedicated model per regime with a gating function on the regime variable. Overkill for most problems but powerful when regimes are physically very distinct.

Connection to other techniques

Regime binning is essentially supervised clustering. It's a structured version of what Gaussian Mixture Models or hierarchical clustering discover automatically. When you know the physics, always impose the structure — it's more sample-efficient than making the model rediscover it.

💡 In practice

For motor control: bin by (speed, torque, temperature) jointly — a 3D operating-point map. This is already how you'd look at a flux map; make your training and validation protocol match.
Start with 3–5 bins per dimension. More bins = more statistical noise per bin; fewer bins = mixed regimes within a bin.
Report per-bin residuals as a heatmap during debugging. It instantly reveals which part of the operating space your model is failing on.

06 · Underfitting

High bias. The model is fundamentally too simple — or too constrained — to capture the patterns in the data. It can't even do well on the data it was trained on.

Symptoms

Training loss is high and plateaus early.
Validation loss is also high, and close to the training loss (small train-val gap).
The model's predictions look systematically biased — e.g., it predicts the mean for every input.
Residuals show structure (not just noise) — if you plot prediction error versus input, you see patterns.

Causes and fixes

Cause	Fix
Model has insufficient capacity	Add layers, add neurons, use a richer architecture
Features are insufficient	Add features, do feature engineering, incorporate domain knowledge
Too much regularization	Reduce λ (L1/L2), reduce dropout rate
Learning rate too high (can't converge)	Reduce LR, use a warmup schedule
Learning rate too low (gets stuck)	Increase LR, use a cyclic schedule
Poor initialization	He / Xavier initialization; load a pretrained backbone
Wrong loss function for the task	Check objective matches the problem (e.g., regression vs classification)
Insufficient training	Train for more epochs; watch the training loss still decreasing

Diagnosing the cause

The standard test is to train the same architecture on a tiny subset (say 32 samples) and see if it can perfectly overfit it. If yes, the architecture is capable and the problem is optimization or regularization. If no, the architecture or representation is too weak. This "can my model memorize a batch?" test is one of the first things to run on any new setup.

07 · Overfitting

High variance. The model has too much capacity relative to the data, and it memorizes the noise in the training set instead of learning the underlying signal.

Symptoms

Training loss keeps decreasing.
Validation loss starts decreasing, then increases — the classic U-shape.
Large gap between training and validation accuracy.
Predictions are wildly confident even when wrong.
Small perturbations to the input produce large changes in the output (low Lipschitz).

The bias-variance decomposition

For squared-error loss, the expected generalization error decomposes as:

E[(ŷ − y)²] = Bias² + Variance + Irreducible noise

Underfitting = high bias. Overfitting = high variance. Increasing model capacity pushes bias down but variance up. More data reduces variance without increasing bias. Regularization trades some variance for some bias. The "sweet spot" is model-and-data dependent; that's what validation metrics are for.

Fixes, in order of cost

More data — always the best fix. Data augmentation gets you part of the way for free.
Early stopping — free; just stop at validation minimum.
Regularization — L2, L1, dropout, weight decay.
Smaller model — reduce depth, width, or switch to a simpler architecture.
Label smoothing — prevents overconfident outputs.
Ensembling — average multiple models to reduce variance.
Noise injection — add noise to inputs, weights, or activations during training.

🚨 Overfitting can be subtle

Overfitting to the validation set — after 500 hyperparameter trials, your val score is optimistic. This is why you also need a held-out test set.
Selection bias — reporting the best of many runs. Always report mean ± std across seeds.
Distribution shift masquerading as overfitting — train–val gap can be caused by genuine distribution mismatch rather than capacity. Fix the data, not the regularizer.

08 · Gradient Descent

The optimization engine of almost all deep learning. Iteratively move the weights opposite to the gradient of the loss.

Update rule

θ_t+1 = θ_t − η · ∇_θL(θ_t)

where η is the learning rate and ∇_θL is the gradient of the loss with respect to the parameters. This is the "steepest descent" update — locally, it's the direction that most rapidly decreases the loss.

Variants

Variant	Gradient uses	Tradeoff
Batch GD	All training samples	Smooth but slow; memory-intensive
SGD	One sample at a time	Very noisy; can escape saddle points; slow per-epoch
Mini-batch SGD	Batch of 32–2048 samples	The practical default. Good compromise

Beyond vanilla SGD

Momentum — add a running average of past gradients. Accelerates convergence in ravines; smooths oscillation.
v_t+1 = βv_t + ∇L; θ_t+1 = θ_t − ηv_t+1
Typical β = 0.9.
Nesterov momentum — "look ahead" variant. Evaluate the gradient at θ + βv rather than at θ. Slightly better convergence on convex problems; marginal on deep nets.
AdaGrad — per-parameter LR that shrinks with the accumulated squared gradient. Great for sparse features; stalls on deep nets as the LR decays to zero.
RMSProp — AdaGrad with an exponential moving average instead of a running sum. Fixes the stall.

Adam — RMSProp + momentum + bias correction.

m = β₁·m + (1 − β₁)·g         # first moment
v = β₂·v + (1 − β₂)·g²        # second moment
m̂ = m / (1 − β₁ᵗ)            # bias correction
v̂ = v / (1 − β₂ᵗ)
θ = θ − η · m̂ / (√v̂ + ε)

Defaults: β₁=0.9, β₂=0.999, ε=10⁻⁸. LR typically 1e-3 to 1e-4.

AdamW — decoupled weight decay. The weight-decay term is applied directly to θ, not through the gradient. This matters because Adam's adaptive denominator distorts L2 regularization. Always prefer AdamW over Adam with weight decay.
LAMB / LARS — layer-wise scaling for very large batch sizes (> 8k). Standard for training large transformers.

Optimization landscape challenges

Saddle points — far more common than local minima in high dimensions. SGD's noise helps escape them.
Plateaus — flat regions where gradients vanish. Adaptive methods help.
Ravines — steep in one direction, shallow in another. Momentum is essential here.
Exploding gradients — solve with gradient clipping (norm or value).
Vanishing gradients — solve with residual connections, careful initialization (He/Xavier), batch norm, or activation choice (ReLU family instead of sigmoid/tanh).

09 · Learning Rate

The single most important hyperparameter. If you can only tune one thing, tune this. Too large and training diverges; too small and it crawls or stalls.

Intuition

The loss landscape is high-dimensional and curved. The learning rate controls how big a step you take along the gradient. In a smooth bowl-shaped loss, a large LR overshoots; a small LR takes forever. In a narrow ravine, a large LR bounces off the walls. The ideal LR depends on the local curvature — which is why adaptive optimizers exist, and why scheduling matters.

How to find a good starting LR

LR range test (Smith 2017) — start with a tiny LR (e.g., 10⁻⁷), double it every few iterations, record loss. Plot loss vs LR. The loss decreases sharply, hits a minimum, then diverges. Pick an LR roughly an order of magnitude below where loss starts climbing.
Default guesses by optimizer — Adam: 1e-3 or 3e-4. SGD with momentum: 1e-1 or 1e-2 (higher because no adaptive denominator). Fine-tuning from pretrained weights: 10–100× smaller than from-scratch.

Scheduling — mandatory in practice

Schedule	Shape	When to use
Constant	Flat	Debugging; almost never in production
Step decay	Cliff drops every N epochs	Computer vision legacy; interpretable
Exponential	Smooth geometric decay	General-purpose
Cosine annealing	Half-cosine down to zero	Default for most modern nets
Warmup + cosine	Linear ramp then cosine	Transformers, large batches, deep nets
SGDR (warm restarts)	Periodic cosine restarts	Ensembling snapshots; exploring basins
Reduce on plateau	Drop when val loss stalls	Reactive; good when you don't know epoch count
Cyclical LR (CLR)	Triangular waves	Alternative to SGDR; exploration-heavy

Linear scaling rule

When multiplying the batch size by k, multiply the LR by k (Goyal et al. 2017). This keeps the effective step size per epoch roughly constant. Valid up to very large batch sizes (~8k); beyond that, you need warmup or LARS to prevent early divergence.

LR × optimizer interaction

Adam's effective per-parameter LR is already scaled by the inverse square root of the accumulated squared gradient — so a global LR schedule matters less for Adam than for SGD, but still matters.
Very low LR with Adam can let the denominator blow up early (tiny gradients) then stall. Small warmup helps.
SGD benefits strongly from cosine annealing. Adam benefits modestly.

💡 A common recipe that just works

AdamW optimizer, LR = 3e-4, weight decay = 0.01.
Linear warmup for 500–2000 steps, then cosine decay to zero.
Gradient clipping at norm 1.0.
Adjust LR ±3× based on LR range test output.

10 · Regularization

Anything that discourages the model from memorizing the training set. Most regularizers work by imposing a prior — on the weights, on the activations, or on the data itself.

Weight-space regularization

Add a penalty on the weights to the loss:

L_total = L_data(w) + λ · R(w)

L2 (Ridge / weight decay) — R(w) = ½‖w‖². Gaussian prior on weights. Pulls weights smoothly toward zero; solutions stay smooth. Nearly always on; λ typically 1e-5 to 1e-3.
L1 (Lasso) — R(w) = ‖w‖₁. Laplace prior. Pushes many weights to exactly zero → sparsity and implicit feature selection. Use when you believe only a subset of inputs matter.
Elastic Net — α‖w‖₁ + (1−α)½‖w‖². Combines L1's sparsity with L2's stability. Standard in classical ML, less common in deep learning.

L2 vs weight decay in Adam

For SGD, adding λ‖w‖² to the loss and subtracting λw from the gradient are mathematically identical. For Adam, they're not — the adaptive denominator in Adam distorts the L2 gradient, effectively scaling the regularization per-parameter. AdamW fixes this by applying weight decay directly to the weights, outside the adaptive machinery. Always use AdamW over Adam + L2.

Dropout

During training, independently drop each activation with probability p. At inference, keep all activations and scale by (1 − p) (or use inverted dropout: scale training activations by 1/(1−p) instead). This is approximately equivalent to training an exponentially large ensemble of sub-networks and averaging at inference.

Typical p = 0.2–0.5 for fully-connected layers.
Dropout is much less effective in convolutional layers — use spatial dropout (drop whole feature maps) or skip it.
In transformers, attention dropout and residual dropout are both common, each with p ≈ 0.1.
DropConnect is the weight-space analog — drop individual weights rather than activations.

Normalization as regularization

Batch normalization and layer normalization are primarily optimization tools (they rescale activations to stabilize training), but they also act as mild regularizers. BN adds noise via the per-batch statistics; LN doesn't, but does restrict the hypothesis class. With BN present, dropout is often redundant or counterproductive.

Data-space regularization

Usually stronger than weight-space regularization for modern deep nets:

Data augmentation — flips, crops, rotations, color jitter, time-warping, noise injection. Effectively increases dataset size.
Mixup — train on linear interpolations of pairs of samples and their labels.
CutMix — paste a patch from one image into another, mix the labels proportionally.
Label smoothing — replace hard 0/1 labels with e.g. 0.05/0.95. Prevents overconfident outputs; improves calibration.
Stochastic depth — randomly skip entire residual blocks during training (regularization + speedup).

Early stopping

Monitor validation loss; stop when it hasn't improved for N epochs (the "patience"). Save the best-so-far checkpoint and restore it at the end. Free, effective, always enable it.

11 · Hyperparameter Search

The outer loop around training. You can't gradient-descend through the choice of architecture, the learning rate, or the batch size — you have to search over them explicitly.

The three classical strategies

Strategy	Mechanism	Strengths	Weaknesses
Grid search	Evaluate every combination	Exhaustive; easy to parallelize; reproducible	Cost is exponential in dimensions; wastes budget on insensitive axes
Random search	Sample uniformly from the space	Usually beats grid at equal budget (Bergstra & Bengio 2012); trivially parallel	No learning from past trials
Bayesian opt	Fit surrogate, use acquisition function to pick next	Sample-efficient; principled exploration/exploitation	Serial bottleneck; surrogate can be miscalibrated

Why random beats grid

In most problems, a few hyperparameters matter a lot and many matter very little. Grid search wastes budget sampling the irrelevant axes densely. Random search naturally allocates more distinct values to the axes that matter. This was the surprising empirical finding of Bergstra & Bengio: with the same number of trials, random search finds better configurations than grid search on most deep-learning problems.

Bayesian optimization in detail

Two components:

Surrogate model — approximates the objective function given the trials seen so far. Gaussian Process (GP) is the textbook choice; Tree-structured Parzen Estimator (TPE) is Optuna's default; Random Forest is used by SMAC.
Acquisition function — picks the next point to try, balancing exploration (areas of high uncertainty) against exploitation (areas of high predicted value). Common choices:
- Expected Improvement (EI) — expected amount by which we'd beat the current best.
- Upper Confidence Bound (UCB) — mean + κ · std. Tune κ for exploration.
- Probability of Improvement (PI) — probability of beating the current best.
- Thompson sampling — sample a function from the posterior, optimize it, use its argmax. Naturally parallelizable.

Multi-fidelity methods

When each trial is expensive (minutes to days), you can't afford many. Multi-fidelity methods start many trials cheaply and kill bad ones early:

Successive Halving — run N trials for a short budget, keep the top half, double their budget, repeat.
Hyperband — runs successive halving with multiple initial budgets. Addresses the "how short should the short budget be?" question.
BOHB — Hyperband + Bayesian optimization. Uses Bayes to pick the configurations and Hyperband to allocate budget.
ASHA — asynchronous variant of Hyperband. Trivially parallel; state of the art for deep-learning sweeps.
PBT (Population-Based Training) — train many copies in parallel; periodically copy weights from good to bad trials while perturbing hyperparameters. Combines training and searching.

Practical guidelines

Use log scale for learning rate, weight decay, regularization strengths. Use linear for integer counts.
Define a reasonable search space before starting. Don't let Bayesian opt explore LR = 10¹⁰ — use sensible bounds.
Start with random search for ~20 trials to characterize the space, then switch to Bayesian opt with a warm start.
Report mean ± std across seeds, not just best-of-many.
Libraries: Optuna (flexible, TPE default), Ray Tune (distributed), W&B Sweeps (integrated UI), Hyperopt (older TPE).

🚨 Traps

Tuning on the test set. All hyperparameters must be chosen using the validation set.
Using the same seed for all trials — masks variance.
Searching too wide a space early. Exploration budget scales poorly with space volume.
Forgetting to fix the random seed for the data split — different folds produce different "best" hyperparameters.

12 · Backpropagation

Reverse-mode automatic differentiation applied to a computation graph. The algorithm that makes deep learning computationally feasible.

The core idea

Given a computation graph representing the loss as a composition of differentiable operations, backpropagation applies the chain rule backwards from the output to every parameter. The key trick: each intermediate gradient is computed exactly once and reused.

Why reverse-mode?

For a function f: ℝⁿ → ℝᵐ (n inputs, m outputs), you have two choices:

Forward-mode autodiff — cost scales with n (the number of inputs). Good when n is small.
Reverse-mode autodiff — cost scales with m (the number of outputs). Good when m is small.

Neural network training has m = 1 (the scalar loss) and n = millions (the weights). Reverse-mode wins by a factor of n/m = millions. This is why all deep-learning frameworks use reverse-mode.

The forward-backward pattern

# forward pass
h₁ = W₁ · x
a₁ = ReLU(h₁)
h₂ = W₂ · a₁
ŷ = softmax(h₂)
L = CE(ŷ, y)

# backward pass — apply chain rule
dL/dŷ   = ŷ − y                         # softmax + CE
dL/dh₂  = dL/dŷ                         # (uses softmax jacobian)
dL/dW₂  = dL/dh₂ · a₁ᵀ
dL/da₁  = W₂ᵀ · dL/dh₂
dL/dh₁  = dL/da₁ ⊙ (h₁ > 0)             # ReLU derivative
dL/dW₁  = dL/dh₁ · xᵀ

Complexity

The backward pass costs roughly the same as the forward pass — within a constant factor of 2–3×. Memory is the bigger cost: to compute gradients, you must cache all intermediate activations from the forward pass. This is why memory usage scales with depth × batch size × activation size.

Gradient flow issues

Vanishing gradients — in deep networks with saturating activations (sigmoid, tanh), the gradient shrinks exponentially with depth. Solutions: ReLU family activations, residual connections (ResNets), careful initialization (He/Xavier).
Exploding gradients — especially in RNNs and very deep networks. Solution: gradient clipping (clip the norm of the gradient vector to some max value).

Memory-saving tricks

Gradient checkpointing — don't cache all activations; recompute some during the backward pass. Trades compute for memory.
Mixed-precision training — keep master weights in FP32, do forward/backward in FP16/BF16. Halves memory, often speeds up compute on modern GPUs.
Gradient accumulation — simulate a large batch by running several small batches and summing their gradients before stepping. Essential when a single large batch won't fit in memory.

Dynamic vs static graphs

Dynamic graphs (PyTorch, TF eager) — the graph is built on the fly during the forward pass. Easy to debug and to use Python control flow.
Static graphs (TF 1.x, XLA, TorchScript) — graph is defined once and reused. Enables more aggressive optimization (operator fusion, memory planning). Modern frameworks JIT-compile dynamic graphs to get the best of both.

13 · Feedforward Neural Networks (MLPs)

The simplest deep architecture: stacked affine transformations with nonlinearities between them. The "Hello, World!" of neural networks and still the right tool for plenty of tabular problems.

Structure

a⁽ˡ⁺¹⁾ = φ(W⁽ˡ⁾ a⁽ˡ⁾ + b⁽ˡ⁾)

Each layer is an affine transform followed by an elementwise nonlinearity φ. No memory, no recurrence, no spatial awareness — just a stack of functions that takes an input vector and produces an output vector.

Activation functions

Name	Formula	Characteristics
Sigmoid	1/(1+e⁻ˣ)	Saturates, vanishing gradients, output in (0,1). Mostly replaced.
Tanh	(eˣ−e⁻ˣ)/(eˣ+e⁻ˣ)	Zero-centered sigmoid. Used in some RNNs.
ReLU	max(0, x)	Default for hidden layers. Sparse, non-saturating. Can "die" (output stuck at 0).
Leaky ReLU	max(αx, x), α≈0.01	Fixes dying ReLU.
GELU	x·Φ(x)	Smooth, used in transformers (BERT, GPT).
Swish/SiLU	x·sigmoid(x)	Smooth, slight edge over ReLU on deep nets.
Softmax	eᶻⁱ/Σeᶻʲ	Output layer for multiclass classification.

Universal Approximation Theorem

A feedforward network with a single hidden layer of sufficient width and a non-polynomial activation can approximate any continuous function on a compact domain to arbitrary accuracy (Cybenko 1989, Hornik 1991). This is a theoretical guarantee of expressive power, not a practical recipe — in practice, deep narrow networks are far more sample-efficient than shallow wide ones.

When to use an MLP

Tabular data — fixed-size feature vectors, no spatial or temporal structure. MLPs remain competitive here; gradient-boosted trees (XGBoost, LightGBM) often still win on small-to-medium tabular datasets.
Regression heads — on top of a feature extractor (CNN, RNN, Transformer), an MLP is the canonical "read-out" head.
Critic/value networks — in reinforcement learning.
Physics-informed models — as a flexible function approximator inside a structured model (neural ODEs, PINNs, surrogate models for expensive simulators).

When not to use an MLP

Images → CNNs or Vision Transformers. MLPs ignore spatial structure and need vastly more parameters.
Sequences → RNNs/LSTMs/GRUs, or Transformers. MLPs have no memory.
Graphs → Graph Neural Networks.

Initialization

Xavier/Glorot — variance = 2/(fan_in + fan_out). For tanh/sigmoid.
He/Kaiming — variance = 2/fan_in. For ReLU family. The default for modern nets.
Zero initialization breaks symmetry only for biases, not weights — never use it for weight matrices.

14 · Recurrent Neural Networks

Feedforward networks augmented with a hidden state that feeds back in at the next timestep. The classical way to process sequences, largely supplanted by Transformers but still relevant on resource-constrained devices.

The core recurrence

h_t = φ(W_x x_t + W_h h_t−1 + b)

y_t = W_y h_t + b_y

The same weights W_x, W_h are reused across all timesteps — this is the weight-sharing that makes RNNs work on sequences of arbitrary length with a fixed parameter count.

Training: backprop through time (BPTT)

Unroll the RNN for T timesteps, apply standard backpropagation through the unrolled graph. Gradient of the loss with respect to W_h involves products of T jacobians — which leads directly to the core problem:

Vanishing and exploding gradients

If the spectral radius of W_h is less than 1, gradients shrink exponentially as they propagate backward through time → vanishing gradients → can't learn long-range dependencies. If greater than 1, gradients blow up → exploding gradients → training diverges. Plain RNNs struggle to remember anything more than ~10 steps back.

LSTM — the classic fix

Long Short-Term Memory (Hochreiter & Schmidhuber 1997) adds a separate cell state that flows through the sequence with only minor modifications at each step, controlled by three gates:

Forget gate f_t — how much of the previous cell state to keep.
Input gate i_t — how much of the proposed new content to add.
Output gate o_t — how much of the cell state to expose as the hidden state.

The cell state's update is approximately additive, so gradients can flow through it without vanishing. LSTMs can learn dependencies hundreds of steps long.

GRU — the simpler cousin

Gated Recurrent Unit (Cho et al. 2014) merges the forget and input gates into a single "update gate" and combines the cell and hidden states. Fewer parameters than LSTM, often similar performance. A reasonable default when you don't want to debate LSTM vs GRU.

Variants and tricks

Bidirectional RNN — one forward pass, one backward pass, concatenate hidden states. Useful when the whole sequence is available (not for real-time).
Encoder–decoder — the classic architecture for sequence-to-sequence tasks (translation). Encoder RNN compresses the input into a context vector; decoder RNN generates the output.
Attention over RNN states — the predecessor to Transformers. Instead of a single context vector, the decoder attends to all encoder hidden states.
Teacher forcing — during training, feed the ground-truth previous token as input rather than the model's own prediction. Faster convergence, but causes exposure bias at inference.

Why Transformers won

Parallelism — RNNs are inherently sequential; Transformers process all positions in parallel.
Unlimited receptive field — every token attends to every other token directly.
Better scaling — empirically, Transformers benefit more from scale.

Where RNNs still make sense

Streaming / online inference with very tight latency, where you can't afford to re-run attention over the whole context.
Very long sequences where O(n²) attention is too expensive (though linear-attention Transformers are closing this gap).
Tiny memory budgets on edge devices — a small GRU is still hard to beat on a microcontroller.

15 · Quantization

Reduce numeric precision of weights and activations — typically FP32 → INT8 — to shrink the model, accelerate inference, and lower energy consumption. Essential for edge deployment.

The basic mapping

q = round(x / s) + z x ≈ s · (q − z)

where s is the scale and z is the zero point. Pick them so that the quantized range [q_min, q_max] (e.g., [−128, 127] for signed INT8) covers the float range [x_min, x_max] accurately.

Symmetric vs asymmetric

Symmetric — z = 0. Range is [−x_max, x_max]. Faster matmul (no zero-point correction). Wastes range if the distribution is asymmetric (e.g., post-ReLU activations which are ≥ 0).
Asymmetric — z ≠ 0, chosen to best match the actual range. Better accuracy on asymmetric distributions; slightly more expensive compute.

Standard practice: symmetric for weights (roughly zero-mean), asymmetric for activations.

Granularity

Per-tensor — one scale/zero for the whole tensor. Smallest overhead, worst accuracy for tensors with varied statistics.
Per-channel (per-axis) — separate scale per output channel (for weights) or per feature map (for activations). Much better accuracy for weights; standard for CNNs.

Post-Training Quantization (PTQ)

Take a trained FP32 model, run a calibration dataset through it to collect activation statistics, compute the scales and zero points, convert the weights. No retraining required.

Dynamic PTQ — quantize weights offline, quantize activations on the fly during inference (scales computed from running min/max or simple percentiles). Cheap, simple, tolerates any batch.
Static PTQ — quantize weights and activations offline, using a calibration set to fix activation scales. Faster inference (no per-input quantization), better accuracy.

PTQ typically loses 0–2% accuracy at INT8 on standard vision models, 3–5% on more sensitive architectures. At INT4 and below, PTQ usually isn't enough.

Quantization-Aware Training (QAT)

Insert "fake quantization" nodes into the model during training — they simulate the rounding and clipping of the target bit-width while keeping gradients flowing (straight-through estimator). The model learns weights that are robust to quantization noise.

QAT recovers almost all of the PTQ accuracy loss and is necessary for INT4, INT2, and binary networks. Costs about 1 additional training epoch's worth of effort.

Hardware support

Platform	Preferred formats
NVIDIA GPU (Ampere+)	FP16, BF16, INT8 via Tensor Cores; FP8 on Hopper
Google TPU	BF16 training, INT8 inference
ARM Cortex-A / Neon	INT8 matmul via NEON intrinsics
ARM Cortex-M / CMSIS-NN	INT8, INT16 fixed-point
TriCore / automotive MCUs	Fixed-point (usually INT16 or INT32 with custom scaling)

Typical wins

FP32 → INT8: 4× smaller, 2–4× faster inference, 2–4× less energy, < 1% accuracy loss.
FP32 → INT4 (with QAT): 8× smaller, needs specialized hardware to realize speedup.

🚨 Traps

Calibrating on an unrepresentative dataset produces miscalibrated scales and silent accuracy loss.
Per-tensor quantization of weights with wide dynamic range (one huge outlier channel) destroys accuracy — always use per-channel for weights.
Layer norm and softmax are numerically sensitive — often kept in FP16 even in an otherwise INT8 model.
Depthwise convolutions are unusually sensitive to quantization; may need QAT even if the rest of the model doesn't.

16 · Pruning

Identify and remove the weights (or neurons, or channels, or whole layers) that contribute least to the network's output. Shrinks the model and, if done structurally, speeds up inference.

Granularity: the key choice

Granularity	What's removed	Compression	Speedup on dense HW
Unstructured (weight-level)	Individual weights	Very high (90%+ achievable)	None — produces a sparse matrix
Vector / row / column	Sub-blocks of a weight matrix	High	Some, with structured sparse kernels
Filter / channel	Entire output channels of a conv or FC layer	Moderate	Full speedup on any HW
Layer / block	Whole layers (often in ResNets)	Modest	Full speedup

The fundamental tradeoff: unstructured pruning gives the highest compression ratio, but the resulting sparse weight matrix doesn't run any faster on a dense GEMM (the standard CPU/GPU matmul). To get actual speedup, you need structured pruning, or specialized sparse hardware (NVIDIA 2:4 sparsity on Ampere+, some mobile accelerators).

Pruning criteria

Magnitude — remove weights with |w| below a threshold. Simple, strong baseline, used everywhere.
Gradient-based — remove weights whose gradient is small (they're not actively being updated).
Taylor expansion — estimate the effect of removing each weight on the loss using first- or second-order Taylor terms. More principled.
Fisher information — weights with low Fisher info contribute little to the model's predictions.
Movement pruning — for fine-tuning: prune weights that moved toward zero during training. Particularly effective for transformer fine-tuning.
Filter-level — prune filters with low L1/L2 norm, low batch-norm scaling factor, or low activation magnitude on validation data.

The prune–fine-tune cycle

Train a model to convergence.
Compute pruning scores for every weight/channel/filter.
Remove the bottom X% (or set them to zero and mask them out).
Fine-tune for a few epochs to recover accuracy.
Repeat (iterative pruning) — often outperforms one-shot pruning of the same total amount.

Iterative Magnitude Pruning (IMP) and the Lottery Ticket Hypothesis

Frankle & Carbin (2018) showed that for many networks, if you prune a trained network heavily (say 90%), then reset the surviving weights to their initial values and retrain from scratch — the sparse network trains to nearly the original accuracy. This suggests that dense networks contain small "winning ticket" subnetworks that, once found, are sufficient on their own.

One-shot vs gradual pruning

One-shot — prune all X% in one step, fine-tune. Simple but less robust.
Gradual — increase the pruning ratio smoothly from 0 to X% over many epochs while training continues. Less accuracy drop, more compute.

Practical guidance

Start with magnitude pruning. Use it unless you have evidence something else works better.
For deployment speedup on general hardware, use structured (filter-level) pruning.
For models that will be followed by quantization, prune first — quantization of a pruned model often behaves better.
For fine-tuning a large pretrained model on a small dataset, movement pruning is state-of-the-art.

17 · Knowledge Distillation

Train a small "student" model to imitate a large "teacher" — not just its predictions, but its full probability distribution. Transfers information beyond what hard labels convey.

The Hinton formulation (2015)

Let z be the logits of a model. Standard softmax gives p = softmax(z). Add a temperature T > 1:

p_i^(T) = exp(z_i / T) / Σ_j exp(z_j / T)

Higher T produces a softer distribution (all probabilities pulled toward uniform). The student is trained to match the teacher's softened distribution and the hard labels:

L = α · CE(student, hard) + (1−α) · T² · KL(student_T, teacher_T)

The T² factor is there to keep the distillation gradient magnitude independent of T.

Why it works: dark knowledge

A teacher that outputs [0.7, 0.2, 0.08, 0.02] for four classes is saying more than just "class 1." The relative probabilities encode similarity structure: class 2 is "more like" class 1 than class 4 is. This information is present in the teacher but absent from the one-hot hard label. Temperature amplifies it.

Variants

FitNets (Romero et al. 2015) — match the student's intermediate activations to the teacher's, not just the output.
Attention Transfer — match the spatial attention maps of teacher and student.
DistilBERT — compresses BERT to 40% size, 60% faster, 97% of the accuracy.
TinyBERT, MobileBERT — more aggressive compression with multi-stage distillation.
Self-distillation — the teacher is the same model at an earlier epoch, or an ensemble of the same architecture. Can improve accuracy over training from scratch — "distillation as regularization."
Born-Again Networks — distill to a student of the same size as the teacher. Students often outperform teachers. Mysterious.
Response, feature, and relation-based distillation — three-way taxonomy of what to match: outputs, intermediate features, or relations between features.

Practical notes

Typical T ∈ [2, 10]. Start with T = 4 and α = 0.5.
KD is most effective when the student has enough capacity to express the teacher's behavior — too small a student will hit an accuracy floor.
KD does not require labeled data for distillation loss — you can distill using unlabeled data and just the teacher's outputs. Useful when you have lots of unlabeled data but only some labels.
Stacking KD with quantization and pruning is the standard edge-deployment pipeline.

Common pitfalls

Mismatched temperature between training and inference: at inference, run the student with T = 1 (standard softmax). The T > 1 softening is training-time only.
Poor teacher → poor student. Always verify the teacher's quality first.
Too high α → student ignores the teacher and just learns from hard labels.
Too low α → student can over-imitate teacher mistakes.

18 · Compression Stack for Edge Deployment

Running neural networks on microcontrollers, automotive ECUs, or battery-constrained devices. The end-to-end pipeline that takes a GPU-trained research model down to a ~100 KB binary running at 1 ms latency.

The stack, in order

Start with the right architecture. Don't try to compress a ResNet-152 for a microcontroller; start with MobileNet, EfficientNet-Lite, SqueezeNet, ShuffleNet, or a hand-designed tiny model. Architecture search tailored for mobile (NetAdapt, MnasNet, MCUNet) can find better starting points.
Distillation. Train a larger, more accurate teacher model, then distill knowledge into the smaller target architecture. Typical wins: 1–3% accuracy at no inference-time cost.
Pruning. Structured pruning (filters / channels / heads) to get real speedup on dense hardware. Iterative, with fine-tuning between rounds. Typical wins: 30–70% FLOP reduction at < 1% accuracy loss.
Quantization. INT8 as the standard target. QAT if PTQ isn't enough or you need INT4. Typical wins: 4× smaller, 2–4× faster.
Compile. Convert to a deployment format (TFLite, ONNX Runtime, TensorRT, CoreML, OpenVINO), let the compiler fuse operators, lay out memory, and tile for the target hardware. Often another 1.5–3× speedup on top.

Target hardware and toolchains

Target	Toolchain	Typical use
NVIDIA GPU (server)	TensorRT	Data-center inference
Intel CPU/iGPU	OpenVINO	Industrial / edge server
Mobile phone (ARM)	TFLite, Core ML, PyTorch Mobile	Consumer apps
Microcontroller (Cortex-M)	TFLite Micro, CMSIS-NN, microTVM	Sensor processing, keyword spotting
Automotive MCU (e.g. Infineon AURIX / TriCore)	Vendor SDKs + hand-tuned C, TargetLink, custom fixed-point code	Real-time control, safety-critical
Mobile NPU / DSP	Qualcomm SNPE, Huawei HiAI, Samsung Eden	On-device ML

Constraints that drive the pipeline

Memory — model size, runtime working memory, stack usage. Often the binding constraint on microcontrollers (< 1 MB RAM).
Compute — FLOPs per inference, feasible only if the hardware can finish in the latency budget.
Latency — real-time deadlines. Fixed-point control loops run at kHz rates; missing a deadline is a safety issue.
Energy — battery-powered or ISO 26262-constrained designs. Quantized INT8 uses ~10× less energy per op than FP32.
Determinism — safety-critical systems (ASIL B/C/D) often prohibit dynamic memory, require bounded execution time, and need code that can be statically analyzed and certified.

Multiplicative effect

Each stage multiplies the previous gains:

architecture (5×) × distillation (accuracy margin) × pruning (3×) × quantization (4×) × compiler (2×) ≈ 100× total

Real numbers vary but 50–200× total reduction from initial research model to deployed binary is typical.

Monitoring in production

Accuracy on representative data (not just the validation set) — distribution shift is permanent once deployed.
Latency percentiles (P50, P95, P99), not just averages.
Memory footprint and peak stack depth.
Energy per inference on battery-critical devices.
A safety-critical system also tracks ASIL-level metrics: fault detection rate, single-point fault coverage, diagnostic latency.

💡 Automotive / motor-control specifics

Fixed-point arithmetic with Q-format scaling is still the norm on TriCore and Cortex-M4 targets — INT8 quantization-aware training maps cleanly onto this.
ASIL constraints often exclude any library (including neural-net inference libraries) that hasn't been certified. Certified options exist (TUV-SUD-certified TFLite variants, Infineon Aurix AI stacks) but the certification process influences architecture choices from the start.
Determinism trumps peak performance: a 500 µs consistent worst-case inference time is preferable to a 200 µs average with 2 ms tail.
The AUTOSAR / TargetLink toolchain typically consumes the trained model via generated C code with static LUTs and fixed-point matmul kernels, not an ML runtime.