Neural Networks — Deep Dive Reference
By Majid Mazouchi
Companion to the interactive neural networks page. Each term below goes deeper: intuition, math, key variants, practical details, and the common pitfalls that go with them.
01 · Train / Validation / Test Splits
Three disjoint subsets of your data, each with a strictly different role. The core experimental protocol of supervised ML — get this wrong and nothing downstream is trustworthy.
The three roles
Typical ratios
For medium datasets (10³–10⁵ samples): 70/15/15 or 80/10/10 are standard. For large datasets (>10⁶), the val and test sets can be proportionally tiny — 98/1/1 is fine because 10,000 test samples already give very tight confidence intervals. For small datasets, use k-fold cross-validation instead of a single held-out set.
Stratified splitting
For classification with imbalanced classes, random splitting can put all minority-class samples in the training set by chance. stratify=y in sklearn.train_test_split preserves class proportions across splits. Do this by default for any classification task.
Cross-validation
Instead of one fixed split, rotate through k folds:
- k-fold CV — split into k equal chunks, train on k-1, validate on the remaining one, repeat k times, average results. k=5 and k=10 are typical.
- Stratified k-fold — same, but preserves class balance per fold.
- Time-series CV — expanding window or rolling window. Never shuffle time-series data across splits — it leaks the future into the past.
- Group k-fold — when samples share an identity (same patient, same vehicle, same operating run), put all samples from one group entirely in one fold.
- Nested CV — outer loop for honest test scoring, inner loop for hyperparameter tuning. The gold standard for small datasets but computationally expensive.
🚨 Leakage — the #1 silent killer
- Feature leakage — including a feature that's a proxy for the label (e.g., a post-hoc status field). Trivially inflates metrics.
- Temporal leakage — training on data from the future. Common in time-series: fitting a scaler on the full dataset leaks future statistics.
- Duplicate leakage — near-duplicate samples in both train and test (same image under different file names, same waveform with a different timestamp).
- Subject leakage — same person / vehicle / sensor appears in both train and test. The model learns the subject, not the task.
- Preprocessing leakage — computing the mean/std on the whole dataset then splitting. The test-set statistics just leaked into training.
For motor control / automotive contexts
Operating conditions are correlated across splits by default. If your dataset contains 50 driving cycles, don't shuffle samples — split by cycle. If the data spans multiple vehicles or temperatures, stratify by those regimes, or evaluate held-out-regime generalization explicitly. A model that scores 99% on a random split can collapse at 60% on an out-of-regime split — that gap is the real story.
02 · Signal Preprocessing
Garbage in, garbage out. Every real-world signal needs cleaning before a model sees it. This is the stage where most projects silently succeed or fail.
The canonical pipeline
In order, for raw sensor data:
- Anti-alias filter — before any downsampling. Otherwise high-frequency content aliases down into your band of interest.
- DC removal / detrending — subtract the mean, a linear fit, or a slow-moving baseline (high-pass filter with very low cutoff).
- Outlier detection and handling — z-score, IQR, Hampel, or model-based residuals.
- Smoothing — Savitzky–Golay when peaks matter, Kalman when you have a model.
- Normalization — z-score or min–max, fit on training data only.
- Feature engineering — FFT bins, rolling statistics, domain-specific transforms (Park/Clarke for three-phase motors, envelope detection for bearings).
- Regime labeling — tag each sample with operating-point metadata for stratified validation.
Normalization methods
| Method | Formula | Robust to outliers | When to use |
|---|---|---|---|
| Z-score | (x − μ) / σ | No | Data is approximately Gaussian; default choice |
| Min–max | (x − min) / (max − min) | No | Bounded range required (e.g., image pixels to [0,1]) |
| Robust | (x − median) / IQR | Yes | Heavy-tailed distributions; known outliers |
| Unit-norm | x / ‖x‖₂ | No | Direction matters more than magnitude (cosine similarity) |
Feature engineering — motor-control examples
- Park transform — rotate three-phase (a,b,c) currents into a rotating (d,q) frame aligned with the rotor. The AC problem becomes a DC problem.
- FFT-bin features — specific harmonic amplitudes (1×, 2×, 6× electrical order) are diagnostic for NVH and rotor eccentricity.
- Envelope / Hilbert transform — for bearing fault detection, the amplitude envelope of high-frequency content reveals the bearing characteristic frequencies.
- Rolling stats — 100-ms rolling mean, std, skew, kurtosis of a torque signal to detect transients.
🚨 Traps
- Fitting a scaler on train+val combined leaks statistics from val.
- Decimating before anti-aliasing produces irreversible aliasing artifacts.
- Replacing outliers with the mean biases the distribution and can break downstream features (e.g., rolling std).
- Normalizing per-sample when you should normalize per-feature (or vice versa) — know your axis conventions.
03 · Savitzky–Golay Filter
A local polynomial least-squares smoother that preserves peaks, amplitudes, and derivatives — what you want when the features of the signal carry information, not just its low-frequency trend.
How it works
For every output sample, SG takes a window of 2k+1 neighboring input samples, fits a polynomial of degree P ≤ 2k in a least-squares sense, and sets the output equal to the polynomial's value at the center of the window:
where the coefficients cj come from the first row of (AᵀA)⁻¹Aᵀ and A is the Vandermonde matrix whose rows are [1, j, j², …, jP]. The magic is that for any fixed window size W and polynomial order P, the coefficients are constant — SG reduces to a FIR filter with a fixed impulse response.
Key parameters
SG vs moving average
A plain moving average is equivalent to SG with P = 0 — it fits a constant (the mean) to the window. It attenuates peaks because a peak is, by definition, deviation from a local mean. SG with P ≥ 2 fits curvature, so the output polynomial bends up at a peak the way the data does, and the peak survives.
Frequency response
SG's magnitude response is flatter near DC than a boxcar moving average and has milder ripple in the passband, but worse stopband rejection. If you need clean stopband rejection, a Butterworth or Chebyshev IIR is better. If you need to preserve waveform shape, SG wins.
Real-time use
The standard symmetric SG introduces W/2 samples of group delay — if that matters, you need a causal SG (window uses only past samples, polynomial extrapolated to the current sample). Causal SG has worse noise rejection but zero lag; it's a standard trick in motor control observers.
Edge handling
At the first and last W/2 samples, a symmetric window runs off the edge. Options: pad with the edge value (simple, biases the ends), mirror the signal (smooth but adds a spurious reflection), extrapolate via the local polynomial (best quality, more compute), or just return zeros at the edges (simplest, loses samples).
💡 In practice
- Start with W=11, P=3 for general-purpose smoothing. Sweep to taste.
- For peak detection followed by smoothing, use SG. For baseline removal, use a high-pass filter or a much longer SG window subtracted from the signal.
- Implementations:
scipy.signal.savgol_filter, MATLABsgolayfilt. Both also support derivative output viaderiv=.
04 · Outlier Removal
Not every weird sample is a problem — but some are sensor glitches, transcription errors, or events so rare they'll dominate a loss function. The art is distinguishing signal from noise.
Detection methods
| Method | How it flags | Assumption | Robust to clusters? |
|---|---|---|---|
| Z-score | |x − μ| / σ > τ | Gaussian | No |
| IQR / Tukey | Outside [Q1 − 1.5·IQR, Q3 + 1.5·IQR] | Reasonably symmetric | Somewhat |
| Hampel | |x − median| / (1.4826·MAD) > τ | Symmetric, moderate tails | Yes |
| Isolation Forest | Average isolation depth in random trees | None (unsupervised) | Yes |
| LOF | Local density vs neighbors | Density-based | Yes |
| Model residuals | Large residual from a first-pass model | First model fits "normal" data | Yes |
Z-score breakdown — why it fails on clusters
Consider 1000 clean samples around 0 plus 50 spike outliers near 10. The mean shifts to ~0.5 and the std inflates dramatically. Real outliers now have z-scores of ~3–4, marginal under a 3σ threshold. Meanwhile legitimate low points get flagged. This is the "breakdown point" problem — classical z-score has 0% breakdown. Hampel uses median and MAD, which have 50% breakdown — up to half your data can be outliers and the estimator still works.
Removal vs flagging vs winsorizing
- Remove — drop the sample. Simple; loses information; changes the sample count.
- Flag — keep the value, add a boolean "is_outlier" feature. The model can learn to use or ignore it.
- Winsorize — clip to a threshold (e.g., the 99th percentile). Preserves sample count, bounds influence. Good default for heavy-tailed but real phenomena.
- Impute — replace with the median or a model-based estimate. Use sparingly; can mask real signal.
Domain-aware handling
A current spike in a PMSM could be a sensor glitch or a real fault. A 500°C temperature reading on a silicon die is impossible; a 500°C reading on a combustion chamber is Tuesday. Always apply domain-specific bounds first (the "possible reading" filter), then statistical methods for whatever's left.
🚨 Traps
- Computing outlier thresholds on the full dataset leaks test statistics into training.
- Aggressive removal on small datasets creates biased estimates.
- Setting a single global threshold when the noise level is regime-dependent (e.g., high-speed samples are intrinsically noisier) — use per-regime thresholds.
- Calling "rare but real" events outliers and throwing them out. In fault detection, rare events are the whole point.
05 · Temperature Grouping (Regime Binning)
Physical systems are not i.i.d. across operating conditions. Binning by temperature — or any regime variable — lets you measure, validate, and train in a way that respects the underlying physics.
Why it matters
A dataset drawn from a motor operating at 20°C, 60°C, and 100°C mixes three physically distinct regimes: winding resistance R rises, magnet flux ψm drops, viscosity in bearings changes, sensor offsets drift. A single global model trained on the pool implicitly averages these effects. Its overall RMSE might be excellent while its per-regime error at 100°C is catastrophic — and you won't notice unless you bin.
Binning methods
- Equal-width — fixed intervals (e.g., 20°C bins). Easiest to interpret; can leave bins empty.
- Equal-frequency (quantile) — bins contain the same number of samples. Better statistical power per bin; bin edges move with the data.
- K-means of the regime variable — data-driven clusters. Best when the regime variable has natural clusters.
- Domain-defined — cold (< 40°C), nominal (40–80°C), hot (> 80°C). Aligns with engineering intuition and derating tables.
What to do with bins
- Per-bin statistics — mean, std, min, max, distribution shape. Answers "is my training data balanced across conditions?"
- Stratified train/val/test splits — ensure every bin is represented in every split, so test-set metrics are honest.
- Stratified validation reporting — always report per-regime error alongside overall error.
- Regime as a feature — one-hot encode the bin (or feed the continuous regime variable directly). Lets the model condition its predictions on the regime.
- Specialist models per bin — mixture-of-experts style. Train a dedicated model per regime with a gating function on the regime variable. Overkill for most problems but powerful when regimes are physically very distinct.
Connection to other techniques
Regime binning is essentially supervised clustering. It's a structured version of what Gaussian Mixture Models or hierarchical clustering discover automatically. When you know the physics, always impose the structure — it's more sample-efficient than making the model rediscover it.
💡 In practice
- For motor control: bin by (speed, torque, temperature) jointly — a 3D operating-point map. This is already how you'd look at a flux map; make your training and validation protocol match.
- Start with 3–5 bins per dimension. More bins = more statistical noise per bin; fewer bins = mixed regimes within a bin.
- Report per-bin residuals as a heatmap during debugging. It instantly reveals which part of the operating space your model is failing on.
06 · Underfitting
High bias. The model is fundamentally too simple — or too constrained — to capture the patterns in the data. It can't even do well on the data it was trained on.
Symptoms
- Training loss is high and plateaus early.
- Validation loss is also high, and close to the training loss (small train-val gap).
- The model's predictions look systematically biased — e.g., it predicts the mean for every input.
- Residuals show structure (not just noise) — if you plot prediction error versus input, you see patterns.
Causes and fixes
| Cause | Fix |
|---|---|
| Model has insufficient capacity | Add layers, add neurons, use a richer architecture |
| Features are insufficient | Add features, do feature engineering, incorporate domain knowledge |
| Too much regularization | Reduce λ (L1/L2), reduce dropout rate |
| Learning rate too high (can't converge) | Reduce LR, use a warmup schedule |
| Learning rate too low (gets stuck) | Increase LR, use a cyclic schedule |
| Poor initialization | He / Xavier initialization; load a pretrained backbone |
| Wrong loss function for the task | Check objective matches the problem (e.g., regression vs classification) |
| Insufficient training | Train for more epochs; watch the training loss still decreasing |
Diagnosing the cause
The standard test is to train the same architecture on a tiny subset (say 32 samples) and see if it can perfectly overfit it. If yes, the architecture is capable and the problem is optimization or regularization. If no, the architecture or representation is too weak. This "can my model memorize a batch?" test is one of the first things to run on any new setup.
07 · Overfitting
High variance. The model has too much capacity relative to the data, and it memorizes the noise in the training set instead of learning the underlying signal.
Symptoms
- Training loss keeps decreasing.
- Validation loss starts decreasing, then increases — the classic U-shape.
- Large gap between training and validation accuracy.
- Predictions are wildly confident even when wrong.
- Small perturbations to the input produce large changes in the output (low Lipschitz).
The bias-variance decomposition
For squared-error loss, the expected generalization error decomposes as:
Underfitting = high bias. Overfitting = high variance. Increasing model capacity pushes bias down but variance up. More data reduces variance without increasing bias. Regularization trades some variance for some bias. The "sweet spot" is model-and-data dependent; that's what validation metrics are for.
Fixes, in order of cost
- More data — always the best fix. Data augmentation gets you part of the way for free.
- Early stopping — free; just stop at validation minimum.
- Regularization — L2, L1, dropout, weight decay.
- Smaller model — reduce depth, width, or switch to a simpler architecture.
- Label smoothing — prevents overconfident outputs.
- Ensembling — average multiple models to reduce variance.
- Noise injection — add noise to inputs, weights, or activations during training.
🚨 Overfitting can be subtle
- Overfitting to the validation set — after 500 hyperparameter trials, your val score is optimistic. This is why you also need a held-out test set.
- Selection bias — reporting the best of many runs. Always report mean ± std across seeds.
- Distribution shift masquerading as overfitting — train–val gap can be caused by genuine distribution mismatch rather than capacity. Fix the data, not the regularizer.
08 · Gradient Descent
The optimization engine of almost all deep learning. Iteratively move the weights opposite to the gradient of the loss.
Update rule
where η is the learning rate and ∇θL is the gradient of the loss with respect to the parameters. This is the "steepest descent" update — locally, it's the direction that most rapidly decreases the loss.
Variants
| Variant | Gradient uses | Tradeoff |
|---|---|---|
| Batch GD | All training samples | Smooth but slow; memory-intensive |
| SGD | One sample at a time | Very noisy; can escape saddle points; slow per-epoch |
| Mini-batch SGD | Batch of 32–2048 samples | The practical default. Good compromise |
Beyond vanilla SGD
- Momentum — add a running average of past gradients. Accelerates convergence in ravines; smooths oscillation.
vt+1 = βvt + ∇L; θt+1 = θt − ηvt+1Typical β = 0.9.
- Nesterov momentum — "look ahead" variant. Evaluate the gradient at θ + βv rather than at θ. Slightly better convergence on convex problems; marginal on deep nets.
- AdaGrad — per-parameter LR that shrinks with the accumulated squared gradient. Great for sparse features; stalls on deep nets as the LR decays to zero.
- RMSProp — AdaGrad with an exponential moving average instead of a running sum. Fixes the stall.
- Adam — RMSProp + momentum + bias correction.
Defaults: β₁=0.9, β₂=0.999, ε=10⁻⁸. LR typically 1e-3 to 1e-4.m = β₁·m + (1 − β₁)·g # first moment v = β₂·v + (1 − β₂)·g² # second moment m̂ = m / (1 − β₁ᵗ) # bias correction v̂ = v / (1 − β₂ᵗ) θ = θ − η · m̂ / (√v̂ + ε) - AdamW — decoupled weight decay. The weight-decay term is applied directly to θ, not through the gradient. This matters because Adam's adaptive denominator distorts L2 regularization. Always prefer AdamW over Adam with weight decay.
- LAMB / LARS — layer-wise scaling for very large batch sizes (> 8k). Standard for training large transformers.
Optimization landscape challenges
- Saddle points — far more common than local minima in high dimensions. SGD's noise helps escape them.
- Plateaus — flat regions where gradients vanish. Adaptive methods help.
- Ravines — steep in one direction, shallow in another. Momentum is essential here.
- Exploding gradients — solve with gradient clipping (norm or value).
- Vanishing gradients — solve with residual connections, careful initialization (He/Xavier), batch norm, or activation choice (ReLU family instead of sigmoid/tanh).
09 · Learning Rate
The single most important hyperparameter. If you can only tune one thing, tune this. Too large and training diverges; too small and it crawls or stalls.
Intuition
The loss landscape is high-dimensional and curved. The learning rate controls how big a step you take along the gradient. In a smooth bowl-shaped loss, a large LR overshoots; a small LR takes forever. In a narrow ravine, a large LR bounces off the walls. The ideal LR depends on the local curvature — which is why adaptive optimizers exist, and why scheduling matters.
How to find a good starting LR
- LR range test (Smith 2017) — start with a tiny LR (e.g., 10⁻⁷), double it every few iterations, record loss. Plot loss vs LR. The loss decreases sharply, hits a minimum, then diverges. Pick an LR roughly an order of magnitude below where loss starts climbing.
- Default guesses by optimizer — Adam: 1e-3 or 3e-4. SGD with momentum: 1e-1 or 1e-2 (higher because no adaptive denominator). Fine-tuning from pretrained weights: 10–100× smaller than from-scratch.
Scheduling — mandatory in practice
| Schedule | Shape | When to use |
|---|---|---|
| Constant | Flat | Debugging; almost never in production |
| Step decay | Cliff drops every N epochs | Computer vision legacy; interpretable |
| Exponential | Smooth geometric decay | General-purpose |
| Cosine annealing | Half-cosine down to zero | Default for most modern nets |
| Warmup + cosine | Linear ramp then cosine | Transformers, large batches, deep nets |
| SGDR (warm restarts) | Periodic cosine restarts | Ensembling snapshots; exploring basins |
| Reduce on plateau | Drop when val loss stalls | Reactive; good when you don't know epoch count |
| Cyclical LR (CLR) | Triangular waves | Alternative to SGDR; exploration-heavy |
Linear scaling rule
When multiplying the batch size by k, multiply the LR by k (Goyal et al. 2017). This keeps the effective step size per epoch roughly constant. Valid up to very large batch sizes (~8k); beyond that, you need warmup or LARS to prevent early divergence.
LR × optimizer interaction
- Adam's effective per-parameter LR is already scaled by the inverse square root of the accumulated squared gradient — so a global LR schedule matters less for Adam than for SGD, but still matters.
- Very low LR with Adam can let the denominator blow up early (tiny gradients) then stall. Small warmup helps.
- SGD benefits strongly from cosine annealing. Adam benefits modestly.
💡 A common recipe that just works
- AdamW optimizer, LR = 3e-4, weight decay = 0.01.
- Linear warmup for 500–2000 steps, then cosine decay to zero.
- Gradient clipping at norm 1.0.
- Adjust LR ±3× based on LR range test output.
10 · Regularization
Anything that discourages the model from memorizing the training set. Most regularizers work by imposing a prior — on the weights, on the activations, or on the data itself.
Weight-space regularization
Add a penalty on the weights to the loss:
- L2 (Ridge / weight decay) — R(w) = ½‖w‖². Gaussian prior on weights. Pulls weights smoothly toward zero; solutions stay smooth. Nearly always on; λ typically 1e-5 to 1e-3.
- L1 (Lasso) — R(w) = ‖w‖₁. Laplace prior. Pushes many weights to exactly zero → sparsity and implicit feature selection. Use when you believe only a subset of inputs matter.
- Elastic Net — α‖w‖₁ + (1−α)½‖w‖². Combines L1's sparsity with L2's stability. Standard in classical ML, less common in deep learning.
L2 vs weight decay in Adam
For SGD, adding λ‖w‖² to the loss and subtracting λw from the gradient are mathematically identical. For Adam, they're not — the adaptive denominator in Adam distorts the L2 gradient, effectively scaling the regularization per-parameter. AdamW fixes this by applying weight decay directly to the weights, outside the adaptive machinery. Always use AdamW over Adam + L2.
Dropout
During training, independently drop each activation with probability p. At inference, keep all activations and scale by (1 − p) (or use inverted dropout: scale training activations by 1/(1−p) instead). This is approximately equivalent to training an exponentially large ensemble of sub-networks and averaging at inference.
- Typical p = 0.2–0.5 for fully-connected layers.
- Dropout is much less effective in convolutional layers — use spatial dropout (drop whole feature maps) or skip it.
- In transformers, attention dropout and residual dropout are both common, each with p ≈ 0.1.
- DropConnect is the weight-space analog — drop individual weights rather than activations.
Normalization as regularization
Batch normalization and layer normalization are primarily optimization tools (they rescale activations to stabilize training), but they also act as mild regularizers. BN adds noise via the per-batch statistics; LN doesn't, but does restrict the hypothesis class. With BN present, dropout is often redundant or counterproductive.
Data-space regularization
Usually stronger than weight-space regularization for modern deep nets:
- Data augmentation — flips, crops, rotations, color jitter, time-warping, noise injection. Effectively increases dataset size.
- Mixup — train on linear interpolations of pairs of samples and their labels.
- CutMix — paste a patch from one image into another, mix the labels proportionally.
- Label smoothing — replace hard 0/1 labels with e.g. 0.05/0.95. Prevents overconfident outputs; improves calibration.
- Stochastic depth — randomly skip entire residual blocks during training (regularization + speedup).
Early stopping
Monitor validation loss; stop when it hasn't improved for N epochs (the "patience"). Save the best-so-far checkpoint and restore it at the end. Free, effective, always enable it.
11 · Hyperparameter Search
The outer loop around training. You can't gradient-descend through the choice of architecture, the learning rate, or the batch size — you have to search over them explicitly.
The three classical strategies
| Strategy | Mechanism | Strengths | Weaknesses |
|---|---|---|---|
| Grid search | Evaluate every combination | Exhaustive; easy to parallelize; reproducible | Cost is exponential in dimensions; wastes budget on insensitive axes |
| Random search | Sample uniformly from the space | Usually beats grid at equal budget (Bergstra & Bengio 2012); trivially parallel | No learning from past trials |
| Bayesian opt | Fit surrogate, use acquisition function to pick next | Sample-efficient; principled exploration/exploitation | Serial bottleneck; surrogate can be miscalibrated |
Why random beats grid
In most problems, a few hyperparameters matter a lot and many matter very little. Grid search wastes budget sampling the irrelevant axes densely. Random search naturally allocates more distinct values to the axes that matter. This was the surprising empirical finding of Bergstra & Bengio: with the same number of trials, random search finds better configurations than grid search on most deep-learning problems.
Bayesian optimization in detail
Two components:
- Surrogate model — approximates the objective function given the trials seen so far. Gaussian Process (GP) is the textbook choice; Tree-structured Parzen Estimator (TPE) is Optuna's default; Random Forest is used by SMAC.
- Acquisition function — picks the next point to try, balancing exploration (areas of high uncertainty) against exploitation (areas of high predicted value). Common choices:
- Expected Improvement (EI) — expected amount by which we'd beat the current best.
- Upper Confidence Bound (UCB) — mean + κ · std. Tune κ for exploration.
- Probability of Improvement (PI) — probability of beating the current best.
- Thompson sampling — sample a function from the posterior, optimize it, use its argmax. Naturally parallelizable.
Multi-fidelity methods
When each trial is expensive (minutes to days), you can't afford many. Multi-fidelity methods start many trials cheaply and kill bad ones early:
- Successive Halving — run N trials for a short budget, keep the top half, double their budget, repeat.
- Hyperband — runs successive halving with multiple initial budgets. Addresses the "how short should the short budget be?" question.
- BOHB — Hyperband + Bayesian optimization. Uses Bayes to pick the configurations and Hyperband to allocate budget.
- ASHA — asynchronous variant of Hyperband. Trivially parallel; state of the art for deep-learning sweeps.
- PBT (Population-Based Training) — train many copies in parallel; periodically copy weights from good to bad trials while perturbing hyperparameters. Combines training and searching.
Practical guidelines
- Use log scale for learning rate, weight decay, regularization strengths. Use linear for integer counts.
- Define a reasonable search space before starting. Don't let Bayesian opt explore LR = 10¹⁰ — use sensible bounds.
- Start with random search for ~20 trials to characterize the space, then switch to Bayesian opt with a warm start.
- Report mean ± std across seeds, not just best-of-many.
- Libraries: Optuna (flexible, TPE default), Ray Tune (distributed), W&B Sweeps (integrated UI), Hyperopt (older TPE).
🚨 Traps
- Tuning on the test set. All hyperparameters must be chosen using the validation set.
- Using the same seed for all trials — masks variance.
- Searching too wide a space early. Exploration budget scales poorly with space volume.
- Forgetting to fix the random seed for the data split — different folds produce different "best" hyperparameters.
12 · Backpropagation
Reverse-mode automatic differentiation applied to a computation graph. The algorithm that makes deep learning computationally feasible.
The core idea
Given a computation graph representing the loss as a composition of differentiable operations, backpropagation applies the chain rule backwards from the output to every parameter. The key trick: each intermediate gradient is computed exactly once and reused.
Why reverse-mode?
For a function f: ℝⁿ → ℝᵐ (n inputs, m outputs), you have two choices:
- Forward-mode autodiff — cost scales with n (the number of inputs). Good when n is small.
- Reverse-mode autodiff — cost scales with m (the number of outputs). Good when m is small.
Neural network training has m = 1 (the scalar loss) and n = millions (the weights). Reverse-mode wins by a factor of n/m = millions. This is why all deep-learning frameworks use reverse-mode.
The forward-backward pattern
# forward pass
h₁ = W₁ · x
a₁ = ReLU(h₁)
h₂ = W₂ · a₁
ŷ = softmax(h₂)
L = CE(ŷ, y)
# backward pass — apply chain rule
dL/dŷ = ŷ − y # softmax + CE
dL/dh₂ = dL/dŷ # (uses softmax jacobian)
dL/dW₂ = dL/dh₂ · a₁ᵀ
dL/da₁ = W₂ᵀ · dL/dh₂
dL/dh₁ = dL/da₁ ⊙ (h₁ > 0) # ReLU derivative
dL/dW₁ = dL/dh₁ · xᵀ
Complexity
The backward pass costs roughly the same as the forward pass — within a constant factor of 2–3×. Memory is the bigger cost: to compute gradients, you must cache all intermediate activations from the forward pass. This is why memory usage scales with depth × batch size × activation size.
Gradient flow issues
- Vanishing gradients — in deep networks with saturating activations (sigmoid, tanh), the gradient shrinks exponentially with depth. Solutions: ReLU family activations, residual connections (ResNets), careful initialization (He/Xavier).
- Exploding gradients — especially in RNNs and very deep networks. Solution: gradient clipping (clip the norm of the gradient vector to some max value).
Memory-saving tricks
- Gradient checkpointing — don't cache all activations; recompute some during the backward pass. Trades compute for memory.
- Mixed-precision training — keep master weights in FP32, do forward/backward in FP16/BF16. Halves memory, often speeds up compute on modern GPUs.
- Gradient accumulation — simulate a large batch by running several small batches and summing their gradients before stepping. Essential when a single large batch won't fit in memory.
Dynamic vs static graphs
- Dynamic graphs (PyTorch, TF eager) — the graph is built on the fly during the forward pass. Easy to debug and to use Python control flow.
- Static graphs (TF 1.x, XLA, TorchScript) — graph is defined once and reused. Enables more aggressive optimization (operator fusion, memory planning). Modern frameworks JIT-compile dynamic graphs to get the best of both.
13 · Feedforward Neural Networks (MLPs)
The simplest deep architecture: stacked affine transformations with nonlinearities between them. The "Hello, World!" of neural networks and still the right tool for plenty of tabular problems.
Structure
Each layer is an affine transform followed by an elementwise nonlinearity φ. No memory, no recurrence, no spatial awareness — just a stack of functions that takes an input vector and produces an output vector.
Activation functions
| Name | Formula | Characteristics |
|---|---|---|
| Sigmoid | 1/(1+e⁻ˣ) | Saturates, vanishing gradients, output in (0,1). Mostly replaced. |
| Tanh | (eˣ−e⁻ˣ)/(eˣ+e⁻ˣ) | Zero-centered sigmoid. Used in some RNNs. |
| ReLU | max(0, x) | Default for hidden layers. Sparse, non-saturating. Can "die" (output stuck at 0). |
| Leaky ReLU | max(αx, x), α≈0.01 | Fixes dying ReLU. |
| GELU | x·Φ(x) | Smooth, used in transformers (BERT, GPT). |
| Swish/SiLU | x·sigmoid(x) | Smooth, slight edge over ReLU on deep nets. |
| Softmax | eᶻⁱ/Σeᶻʲ | Output layer for multiclass classification. |
Universal Approximation Theorem
A feedforward network with a single hidden layer of sufficient width and a non-polynomial activation can approximate any continuous function on a compact domain to arbitrary accuracy (Cybenko 1989, Hornik 1991). This is a theoretical guarantee of expressive power, not a practical recipe — in practice, deep narrow networks are far more sample-efficient than shallow wide ones.
When to use an MLP
- Tabular data — fixed-size feature vectors, no spatial or temporal structure. MLPs remain competitive here; gradient-boosted trees (XGBoost, LightGBM) often still win on small-to-medium tabular datasets.
- Regression heads — on top of a feature extractor (CNN, RNN, Transformer), an MLP is the canonical "read-out" head.
- Critic/value networks — in reinforcement learning.
- Physics-informed models — as a flexible function approximator inside a structured model (neural ODEs, PINNs, surrogate models for expensive simulators).
When not to use an MLP
- Images → CNNs or Vision Transformers. MLPs ignore spatial structure and need vastly more parameters.
- Sequences → RNNs/LSTMs/GRUs, or Transformers. MLPs have no memory.
- Graphs → Graph Neural Networks.
Initialization
- Xavier/Glorot — variance = 2/(fan_in + fan_out). For tanh/sigmoid.
- He/Kaiming — variance = 2/fan_in. For ReLU family. The default for modern nets.
- Zero initialization breaks symmetry only for biases, not weights — never use it for weight matrices.
14 · Recurrent Neural Networks
Feedforward networks augmented with a hidden state that feeds back in at the next timestep. The classical way to process sequences, largely supplanted by Transformers but still relevant on resource-constrained devices.
The core recurrence
The same weights Wx, Wh are reused across all timesteps — this is the weight-sharing that makes RNNs work on sequences of arbitrary length with a fixed parameter count.
Training: backprop through time (BPTT)
Unroll the RNN for T timesteps, apply standard backpropagation through the unrolled graph. Gradient of the loss with respect to Wh involves products of T jacobians — which leads directly to the core problem:
Vanishing and exploding gradients
If the spectral radius of Wh is less than 1, gradients shrink exponentially as they propagate backward through time → vanishing gradients → can't learn long-range dependencies. If greater than 1, gradients blow up → exploding gradients → training diverges. Plain RNNs struggle to remember anything more than ~10 steps back.
LSTM — the classic fix
Long Short-Term Memory (Hochreiter & Schmidhuber 1997) adds a separate cell state that flows through the sequence with only minor modifications at each step, controlled by three gates:
- Forget gate ft — how much of the previous cell state to keep.
- Input gate it — how much of the proposed new content to add.
- Output gate ot — how much of the cell state to expose as the hidden state.
The cell state's update is approximately additive, so gradients can flow through it without vanishing. LSTMs can learn dependencies hundreds of steps long.
GRU — the simpler cousin
Gated Recurrent Unit (Cho et al. 2014) merges the forget and input gates into a single "update gate" and combines the cell and hidden states. Fewer parameters than LSTM, often similar performance. A reasonable default when you don't want to debate LSTM vs GRU.
Variants and tricks
- Bidirectional RNN — one forward pass, one backward pass, concatenate hidden states. Useful when the whole sequence is available (not for real-time).
- Encoder–decoder — the classic architecture for sequence-to-sequence tasks (translation). Encoder RNN compresses the input into a context vector; decoder RNN generates the output.
- Attention over RNN states — the predecessor to Transformers. Instead of a single context vector, the decoder attends to all encoder hidden states.
- Teacher forcing — during training, feed the ground-truth previous token as input rather than the model's own prediction. Faster convergence, but causes exposure bias at inference.
Why Transformers won
- Parallelism — RNNs are inherently sequential; Transformers process all positions in parallel.
- Unlimited receptive field — every token attends to every other token directly.
- Better scaling — empirically, Transformers benefit more from scale.
Where RNNs still make sense
- Streaming / online inference with very tight latency, where you can't afford to re-run attention over the whole context.
- Very long sequences where O(n²) attention is too expensive (though linear-attention Transformers are closing this gap).
- Tiny memory budgets on edge devices — a small GRU is still hard to beat on a microcontroller.
15 · Quantization
Reduce numeric precision of weights and activations — typically FP32 → INT8 — to shrink the model, accelerate inference, and lower energy consumption. Essential for edge deployment.
The basic mapping
where s is the scale and z is the zero point. Pick them so that the quantized range [qmin, qmax] (e.g., [−128, 127] for signed INT8) covers the float range [xmin, xmax] accurately.
Symmetric vs asymmetric
- Symmetric — z = 0. Range is [−xmax, xmax]. Faster matmul (no zero-point correction). Wastes range if the distribution is asymmetric (e.g., post-ReLU activations which are ≥ 0).
- Asymmetric — z ≠ 0, chosen to best match the actual range. Better accuracy on asymmetric distributions; slightly more expensive compute.
Standard practice: symmetric for weights (roughly zero-mean), asymmetric for activations.
Granularity
- Per-tensor — one scale/zero for the whole tensor. Smallest overhead, worst accuracy for tensors with varied statistics.
- Per-channel (per-axis) — separate scale per output channel (for weights) or per feature map (for activations). Much better accuracy for weights; standard for CNNs.
Post-Training Quantization (PTQ)
Take a trained FP32 model, run a calibration dataset through it to collect activation statistics, compute the scales and zero points, convert the weights. No retraining required.
- Dynamic PTQ — quantize weights offline, quantize activations on the fly during inference (scales computed from running min/max or simple percentiles). Cheap, simple, tolerates any batch.
- Static PTQ — quantize weights and activations offline, using a calibration set to fix activation scales. Faster inference (no per-input quantization), better accuracy.
PTQ typically loses 0–2% accuracy at INT8 on standard vision models, 3–5% on more sensitive architectures. At INT4 and below, PTQ usually isn't enough.
Quantization-Aware Training (QAT)
Insert "fake quantization" nodes into the model during training — they simulate the rounding and clipping of the target bit-width while keeping gradients flowing (straight-through estimator). The model learns weights that are robust to quantization noise.
QAT recovers almost all of the PTQ accuracy loss and is necessary for INT4, INT2, and binary networks. Costs about 1 additional training epoch's worth of effort.
Hardware support
| Platform | Preferred formats |
|---|---|
| NVIDIA GPU (Ampere+) | FP16, BF16, INT8 via Tensor Cores; FP8 on Hopper |
| Google TPU | BF16 training, INT8 inference |
| ARM Cortex-A / Neon | INT8 matmul via NEON intrinsics |
| ARM Cortex-M / CMSIS-NN | INT8, INT16 fixed-point |
| TriCore / automotive MCUs | Fixed-point (usually INT16 or INT32 with custom scaling) |
Typical wins
- FP32 → INT8: 4× smaller, 2–4× faster inference, 2–4× less energy, < 1% accuracy loss.
- FP32 → INT4 (with QAT): 8× smaller, needs specialized hardware to realize speedup.
🚨 Traps
- Calibrating on an unrepresentative dataset produces miscalibrated scales and silent accuracy loss.
- Per-tensor quantization of weights with wide dynamic range (one huge outlier channel) destroys accuracy — always use per-channel for weights.
- Layer norm and softmax are numerically sensitive — often kept in FP16 even in an otherwise INT8 model.
- Depthwise convolutions are unusually sensitive to quantization; may need QAT even if the rest of the model doesn't.
16 · Pruning
Identify and remove the weights (or neurons, or channels, or whole layers) that contribute least to the network's output. Shrinks the model and, if done structurally, speeds up inference.
Granularity: the key choice
| Granularity | What's removed | Compression | Speedup on dense HW |
|---|---|---|---|
| Unstructured (weight-level) | Individual weights | Very high (90%+ achievable) | None — produces a sparse matrix |
| Vector / row / column | Sub-blocks of a weight matrix | High | Some, with structured sparse kernels |
| Filter / channel | Entire output channels of a conv or FC layer | Moderate | Full speedup on any HW |
| Layer / block | Whole layers (often in ResNets) | Modest | Full speedup |
The fundamental tradeoff: unstructured pruning gives the highest compression ratio, but the resulting sparse weight matrix doesn't run any faster on a dense GEMM (the standard CPU/GPU matmul). To get actual speedup, you need structured pruning, or specialized sparse hardware (NVIDIA 2:4 sparsity on Ampere+, some mobile accelerators).
Pruning criteria
- Magnitude — remove weights with |w| below a threshold. Simple, strong baseline, used everywhere.
- Gradient-based — remove weights whose gradient is small (they're not actively being updated).
- Taylor expansion — estimate the effect of removing each weight on the loss using first- or second-order Taylor terms. More principled.
- Fisher information — weights with low Fisher info contribute little to the model's predictions.
- Movement pruning — for fine-tuning: prune weights that moved toward zero during training. Particularly effective for transformer fine-tuning.
- Filter-level — prune filters with low L1/L2 norm, low batch-norm scaling factor, or low activation magnitude on validation data.
The prune–fine-tune cycle
- Train a model to convergence.
- Compute pruning scores for every weight/channel/filter.
- Remove the bottom X% (or set them to zero and mask them out).
- Fine-tune for a few epochs to recover accuracy.
- Repeat (iterative pruning) — often outperforms one-shot pruning of the same total amount.
Iterative Magnitude Pruning (IMP) and the Lottery Ticket Hypothesis
Frankle & Carbin (2018) showed that for many networks, if you prune a trained network heavily (say 90%), then reset the surviving weights to their initial values and retrain from scratch — the sparse network trains to nearly the original accuracy. This suggests that dense networks contain small "winning ticket" subnetworks that, once found, are sufficient on their own.
One-shot vs gradual pruning
- One-shot — prune all X% in one step, fine-tune. Simple but less robust.
- Gradual — increase the pruning ratio smoothly from 0 to X% over many epochs while training continues. Less accuracy drop, more compute.
Practical guidance
- Start with magnitude pruning. Use it unless you have evidence something else works better.
- For deployment speedup on general hardware, use structured (filter-level) pruning.
- For models that will be followed by quantization, prune first — quantization of a pruned model often behaves better.
- For fine-tuning a large pretrained model on a small dataset, movement pruning is state-of-the-art.
17 · Knowledge Distillation
Train a small "student" model to imitate a large "teacher" — not just its predictions, but its full probability distribution. Transfers information beyond what hard labels convey.
The Hinton formulation (2015)
Let z be the logits of a model. Standard softmax gives p = softmax(z). Add a temperature T > 1:
Higher T produces a softer distribution (all probabilities pulled toward uniform). The student is trained to match the teacher's softened distribution and the hard labels:
The T² factor is there to keep the distillation gradient magnitude independent of T.
Why it works: dark knowledge
A teacher that outputs [0.7, 0.2, 0.08, 0.02] for four classes is saying more than just "class 1." The relative probabilities encode similarity structure: class 2 is "more like" class 1 than class 4 is. This information is present in the teacher but absent from the one-hot hard label. Temperature amplifies it.
Variants
- FitNets (Romero et al. 2015) — match the student's intermediate activations to the teacher's, not just the output.
- Attention Transfer — match the spatial attention maps of teacher and student.
- DistilBERT — compresses BERT to 40% size, 60% faster, 97% of the accuracy.
- TinyBERT, MobileBERT — more aggressive compression with multi-stage distillation.
- Self-distillation — the teacher is the same model at an earlier epoch, or an ensemble of the same architecture. Can improve accuracy over training from scratch — "distillation as regularization."
- Born-Again Networks — distill to a student of the same size as the teacher. Students often outperform teachers. Mysterious.
- Response, feature, and relation-based distillation — three-way taxonomy of what to match: outputs, intermediate features, or relations between features.
Practical notes
- Typical T ∈ [2, 10]. Start with T = 4 and α = 0.5.
- KD is most effective when the student has enough capacity to express the teacher's behavior — too small a student will hit an accuracy floor.
- KD does not require labeled data for distillation loss — you can distill using unlabeled data and just the teacher's outputs. Useful when you have lots of unlabeled data but only some labels.
- Stacking KD with quantization and pruning is the standard edge-deployment pipeline.
Common pitfalls
- Mismatched temperature between training and inference: at inference, run the student with T = 1 (standard softmax). The T > 1 softening is training-time only.
- Poor teacher → poor student. Always verify the teacher's quality first.
- Too high α → student ignores the teacher and just learns from hard labels.
- Too low α → student can over-imitate teacher mistakes.
18 · Compression Stack for Edge Deployment
Running neural networks on microcontrollers, automotive ECUs, or battery-constrained devices. The end-to-end pipeline that takes a GPU-trained research model down to a ~100 KB binary running at 1 ms latency.
The stack, in order
- Start with the right architecture. Don't try to compress a ResNet-152 for a microcontroller; start with MobileNet, EfficientNet-Lite, SqueezeNet, ShuffleNet, or a hand-designed tiny model. Architecture search tailored for mobile (NetAdapt, MnasNet, MCUNet) can find better starting points.
- Distillation. Train a larger, more accurate teacher model, then distill knowledge into the smaller target architecture. Typical wins: 1–3% accuracy at no inference-time cost.
- Pruning. Structured pruning (filters / channels / heads) to get real speedup on dense hardware. Iterative, with fine-tuning between rounds. Typical wins: 30–70% FLOP reduction at < 1% accuracy loss.
- Quantization. INT8 as the standard target. QAT if PTQ isn't enough or you need INT4. Typical wins: 4× smaller, 2–4× faster.
- Compile. Convert to a deployment format (TFLite, ONNX Runtime, TensorRT, CoreML, OpenVINO), let the compiler fuse operators, lay out memory, and tile for the target hardware. Often another 1.5–3× speedup on top.
Target hardware and toolchains
| Target | Toolchain | Typical use |
|---|---|---|
| NVIDIA GPU (server) | TensorRT | Data-center inference |
| Intel CPU/iGPU | OpenVINO | Industrial / edge server |
| Mobile phone (ARM) | TFLite, Core ML, PyTorch Mobile | Consumer apps |
| Microcontroller (Cortex-M) | TFLite Micro, CMSIS-NN, microTVM | Sensor processing, keyword spotting |
| Automotive MCU (e.g. Infineon AURIX / TriCore) | Vendor SDKs + hand-tuned C, TargetLink, custom fixed-point code | Real-time control, safety-critical |
| Mobile NPU / DSP | Qualcomm SNPE, Huawei HiAI, Samsung Eden | On-device ML |
Constraints that drive the pipeline
- Memory — model size, runtime working memory, stack usage. Often the binding constraint on microcontrollers (< 1 MB RAM).
- Compute — FLOPs per inference, feasible only if the hardware can finish in the latency budget.
- Latency — real-time deadlines. Fixed-point control loops run at kHz rates; missing a deadline is a safety issue.
- Energy — battery-powered or ISO 26262-constrained designs. Quantized INT8 uses ~10× less energy per op than FP32.
- Determinism — safety-critical systems (ASIL B/C/D) often prohibit dynamic memory, require bounded execution time, and need code that can be statically analyzed and certified.
Multiplicative effect
Each stage multiplies the previous gains:
Real numbers vary but 50–200× total reduction from initial research model to deployed binary is typical.
Monitoring in production
- Accuracy on representative data (not just the validation set) — distribution shift is permanent once deployed.
- Latency percentiles (P50, P95, P99), not just averages.
- Memory footprint and peak stack depth.
- Energy per inference on battery-critical devices.
- A safety-critical system also tracks ASIL-level metrics: fault detection rate, single-point fault coverage, diagnostic latency.
💡 Automotive / motor-control specifics
- Fixed-point arithmetic with Q-format scaling is still the norm on TriCore and Cortex-M4 targets — INT8 quantization-aware training maps cleanly onto this.
- ASIL constraints often exclude any library (including neural-net inference libraries) that hasn't been certified. Certified options exist (TUV-SUD-certified TFLite variants, Infineon Aurix AI stacks) but the certification process influences architecture choices from the start.
- Determinism trumps peak performance: a 500 µs consistent worst-case inference time is preferable to a 200 µs average with 2 ms tail.
- The AUTOSAR / TargetLink toolchain typically consumes the trained model via generated C code with static LUTs and fixed-point matmul kernels, not an ML runtime.