Neural Networks — Deep Dive Reference

By Majid Mazouchi

Companion to the interactive neural networks page. Each term below goes deeper: intuition, math, key variants, practical details, and the common pitfalls that go with them.

01 · Train / Validation / Test Splits

Three disjoint subsets of your data, each with a strictly different role. The core experimental protocol of supervised ML — get this wrong and nothing downstream is trustworthy.

The three roles

Training set
Used to fit weights via gradient descent. The model sees these labels directly.
Validation set
Used to tune hyperparameters (LR, depth, regularization, early-stop point, architecture choice). The model never trains on it, but you do make decisions based on it — so it gets "contaminated" by your choices over time.
Test set
Used once, at the very end, to produce the honest reported number. If you look at it and change anything, it's no longer a test set — it's another validation set.

Typical ratios

For medium datasets (10³–10⁵ samples): 70/15/15 or 80/10/10 are standard. For large datasets (>10⁶), the val and test sets can be proportionally tiny — 98/1/1 is fine because 10,000 test samples already give very tight confidence intervals. For small datasets, use k-fold cross-validation instead of a single held-out set.

Stratified splitting

For classification with imbalanced classes, random splitting can put all minority-class samples in the training set by chance. stratify=y in sklearn.train_test_split preserves class proportions across splits. Do this by default for any classification task.

Cross-validation

Instead of one fixed split, rotate through k folds:

🚨 Leakage — the #1 silent killer

  • Feature leakage — including a feature that's a proxy for the label (e.g., a post-hoc status field). Trivially inflates metrics.
  • Temporal leakage — training on data from the future. Common in time-series: fitting a scaler on the full dataset leaks future statistics.
  • Duplicate leakage — near-duplicate samples in both train and test (same image under different file names, same waveform with a different timestamp).
  • Subject leakage — same person / vehicle / sensor appears in both train and test. The model learns the subject, not the task.
  • Preprocessing leakage — computing the mean/std on the whole dataset then splitting. The test-set statistics just leaked into training.

For motor control / automotive contexts

Operating conditions are correlated across splits by default. If your dataset contains 50 driving cycles, don't shuffle samples — split by cycle. If the data spans multiple vehicles or temperatures, stratify by those regimes, or evaluate held-out-regime generalization explicitly. A model that scores 99% on a random split can collapse at 60% on an out-of-regime split — that gap is the real story.


02 · Signal Preprocessing

Garbage in, garbage out. Every real-world signal needs cleaning before a model sees it. This is the stage where most projects silently succeed or fail.

The canonical pipeline

In order, for raw sensor data:

  1. Anti-alias filter — before any downsampling. Otherwise high-frequency content aliases down into your band of interest.
  2. DC removal / detrending — subtract the mean, a linear fit, or a slow-moving baseline (high-pass filter with very low cutoff).
  3. Outlier detection and handling — z-score, IQR, Hampel, or model-based residuals.
  4. Smoothing — Savitzky–Golay when peaks matter, Kalman when you have a model.
  5. Normalization — z-score or min–max, fit on training data only.
  6. Feature engineering — FFT bins, rolling statistics, domain-specific transforms (Park/Clarke for three-phase motors, envelope detection for bearings).
  7. Regime labeling — tag each sample with operating-point metadata for stratified validation.

Normalization methods

MethodFormulaRobust to outliersWhen to use
Z-score(x − μ) / σNoData is approximately Gaussian; default choice
Min–max(x − min) / (max − min)NoBounded range required (e.g., image pixels to [0,1])
Robust(x − median) / IQRYesHeavy-tailed distributions; known outliers
Unit-normx / ‖x‖₂NoDirection matters more than magnitude (cosine similarity)

Feature engineering — motor-control examples

🚨 Traps

  • Fitting a scaler on train+val combined leaks statistics from val.
  • Decimating before anti-aliasing produces irreversible aliasing artifacts.
  • Replacing outliers with the mean biases the distribution and can break downstream features (e.g., rolling std).
  • Normalizing per-sample when you should normalize per-feature (or vice versa) — know your axis conventions.

03 · Savitzky–Golay Filter

A local polynomial least-squares smoother that preserves peaks, amplitudes, and derivatives — what you want when the features of the signal carry information, not just its low-frequency trend.

How it works

For every output sample, SG takes a window of 2k+1 neighboring input samples, fits a polynomial of degree P ≤ 2k in a least-squares sense, and sets the output equal to the polynomial's value at the center of the window:

ŷ[i] = Σj=-kk cj · y[i + j]

where the coefficients cj come from the first row of (AᵀA)⁻¹Aᵀ and A is the Vandermonde matrix whose rows are [1, j, j², …, jP]. The magic is that for any fixed window size W and polynomial order P, the coefficients are constant — SG reduces to a FIR filter with a fixed impulse response.

Key parameters

Window size W
Must be odd. Larger W → more smoothing, more lag (if non-causal), more distortion near rapidly varying features. Typical: 5–51 for sensor data.
Polynomial order P
Must satisfy P < W. Higher P preserves more of the signal's shape but attenuates less noise. Typical: P = 2 or 3. P = 0 is just a moving average.
Derivative order
SG can also return the mth derivative of the fitted polynomial — a smooth derivative estimate. This is how you differentiate a noisy position signal into a usable velocity signal.

SG vs moving average

A plain moving average is equivalent to SG with P = 0 — it fits a constant (the mean) to the window. It attenuates peaks because a peak is, by definition, deviation from a local mean. SG with P ≥ 2 fits curvature, so the output polynomial bends up at a peak the way the data does, and the peak survives.

Frequency response

SG's magnitude response is flatter near DC than a boxcar moving average and has milder ripple in the passband, but worse stopband rejection. If you need clean stopband rejection, a Butterworth or Chebyshev IIR is better. If you need to preserve waveform shape, SG wins.

Real-time use

The standard symmetric SG introduces W/2 samples of group delay — if that matters, you need a causal SG (window uses only past samples, polynomial extrapolated to the current sample). Causal SG has worse noise rejection but zero lag; it's a standard trick in motor control observers.

Edge handling

At the first and last W/2 samples, a symmetric window runs off the edge. Options: pad with the edge value (simple, biases the ends), mirror the signal (smooth but adds a spurious reflection), extrapolate via the local polynomial (best quality, more compute), or just return zeros at the edges (simplest, loses samples).

💡 In practice

  • Start with W=11, P=3 for general-purpose smoothing. Sweep to taste.
  • For peak detection followed by smoothing, use SG. For baseline removal, use a high-pass filter or a much longer SG window subtracted from the signal.
  • Implementations: scipy.signal.savgol_filter, MATLAB sgolayfilt. Both also support derivative output via deriv=.

04 · Outlier Removal

Not every weird sample is a problem — but some are sensor glitches, transcription errors, or events so rare they'll dominate a loss function. The art is distinguishing signal from noise.

Detection methods

MethodHow it flagsAssumptionRobust to clusters?
Z-score|x − μ| / σ > τGaussianNo
IQR / TukeyOutside [Q1 − 1.5·IQR, Q3 + 1.5·IQR]Reasonably symmetricSomewhat
Hampel|x − median| / (1.4826·MAD) > τSymmetric, moderate tailsYes
Isolation ForestAverage isolation depth in random treesNone (unsupervised)Yes
LOFLocal density vs neighborsDensity-basedYes
Model residualsLarge residual from a first-pass modelFirst model fits "normal" dataYes

Z-score breakdown — why it fails on clusters

Consider 1000 clean samples around 0 plus 50 spike outliers near 10. The mean shifts to ~0.5 and the std inflates dramatically. Real outliers now have z-scores of ~3–4, marginal under a 3σ threshold. Meanwhile legitimate low points get flagged. This is the "breakdown point" problem — classical z-score has 0% breakdown. Hampel uses median and MAD, which have 50% breakdown — up to half your data can be outliers and the estimator still works.

Removal vs flagging vs winsorizing

Domain-aware handling

A current spike in a PMSM could be a sensor glitch or a real fault. A 500°C temperature reading on a silicon die is impossible; a 500°C reading on a combustion chamber is Tuesday. Always apply domain-specific bounds first (the "possible reading" filter), then statistical methods for whatever's left.

🚨 Traps

  • Computing outlier thresholds on the full dataset leaks test statistics into training.
  • Aggressive removal on small datasets creates biased estimates.
  • Setting a single global threshold when the noise level is regime-dependent (e.g., high-speed samples are intrinsically noisier) — use per-regime thresholds.
  • Calling "rare but real" events outliers and throwing them out. In fault detection, rare events are the whole point.

05 · Temperature Grouping (Regime Binning)

Physical systems are not i.i.d. across operating conditions. Binning by temperature — or any regime variable — lets you measure, validate, and train in a way that respects the underlying physics.

Why it matters

A dataset drawn from a motor operating at 20°C, 60°C, and 100°C mixes three physically distinct regimes: winding resistance R rises, magnet flux ψm drops, viscosity in bearings changes, sensor offsets drift. A single global model trained on the pool implicitly averages these effects. Its overall RMSE might be excellent while its per-regime error at 100°C is catastrophic — and you won't notice unless you bin.

Binning methods

What to do with bins

Connection to other techniques

Regime binning is essentially supervised clustering. It's a structured version of what Gaussian Mixture Models or hierarchical clustering discover automatically. When you know the physics, always impose the structure — it's more sample-efficient than making the model rediscover it.

💡 In practice

  • For motor control: bin by (speed, torque, temperature) jointly — a 3D operating-point map. This is already how you'd look at a flux map; make your training and validation protocol match.
  • Start with 3–5 bins per dimension. More bins = more statistical noise per bin; fewer bins = mixed regimes within a bin.
  • Report per-bin residuals as a heatmap during debugging. It instantly reveals which part of the operating space your model is failing on.

06 · Underfitting

High bias. The model is fundamentally too simple — or too constrained — to capture the patterns in the data. It can't even do well on the data it was trained on.

Symptoms

Causes and fixes

CauseFix
Model has insufficient capacityAdd layers, add neurons, use a richer architecture
Features are insufficientAdd features, do feature engineering, incorporate domain knowledge
Too much regularizationReduce λ (L1/L2), reduce dropout rate
Learning rate too high (can't converge)Reduce LR, use a warmup schedule
Learning rate too low (gets stuck)Increase LR, use a cyclic schedule
Poor initializationHe / Xavier initialization; load a pretrained backbone
Wrong loss function for the taskCheck objective matches the problem (e.g., regression vs classification)
Insufficient trainingTrain for more epochs; watch the training loss still decreasing

Diagnosing the cause

The standard test is to train the same architecture on a tiny subset (say 32 samples) and see if it can perfectly overfit it. If yes, the architecture is capable and the problem is optimization or regularization. If no, the architecture or representation is too weak. This "can my model memorize a batch?" test is one of the first things to run on any new setup.


07 · Overfitting

High variance. The model has too much capacity relative to the data, and it memorizes the noise in the training set instead of learning the underlying signal.

Symptoms

The bias-variance decomposition

For squared-error loss, the expected generalization error decomposes as:

E[(ŷ − y)²] = Bias² + Variance + Irreducible noise

Underfitting = high bias. Overfitting = high variance. Increasing model capacity pushes bias down but variance up. More data reduces variance without increasing bias. Regularization trades some variance for some bias. The "sweet spot" is model-and-data dependent; that's what validation metrics are for.

Fixes, in order of cost

  1. More data — always the best fix. Data augmentation gets you part of the way for free.
  2. Early stopping — free; just stop at validation minimum.
  3. Regularization — L2, L1, dropout, weight decay.
  4. Smaller model — reduce depth, width, or switch to a simpler architecture.
  5. Label smoothing — prevents overconfident outputs.
  6. Ensembling — average multiple models to reduce variance.
  7. Noise injection — add noise to inputs, weights, or activations during training.

🚨 Overfitting can be subtle

  • Overfitting to the validation set — after 500 hyperparameter trials, your val score is optimistic. This is why you also need a held-out test set.
  • Selection bias — reporting the best of many runs. Always report mean ± std across seeds.
  • Distribution shift masquerading as overfitting — train–val gap can be caused by genuine distribution mismatch rather than capacity. Fix the data, not the regularizer.

08 · Gradient Descent

The optimization engine of almost all deep learning. Iteratively move the weights opposite to the gradient of the loss.

Update rule

θt+1 = θt − η · ∇θL(θt)

where η is the learning rate and ∇θL is the gradient of the loss with respect to the parameters. This is the "steepest descent" update — locally, it's the direction that most rapidly decreases the loss.

Variants

VariantGradient usesTradeoff
Batch GDAll training samplesSmooth but slow; memory-intensive
SGDOne sample at a timeVery noisy; can escape saddle points; slow per-epoch
Mini-batch SGDBatch of 32–2048 samplesThe practical default. Good compromise

Beyond vanilla SGD

Optimization landscape challenges


09 · Learning Rate

The single most important hyperparameter. If you can only tune one thing, tune this. Too large and training diverges; too small and it crawls or stalls.

Intuition

The loss landscape is high-dimensional and curved. The learning rate controls how big a step you take along the gradient. In a smooth bowl-shaped loss, a large LR overshoots; a small LR takes forever. In a narrow ravine, a large LR bounces off the walls. The ideal LR depends on the local curvature — which is why adaptive optimizers exist, and why scheduling matters.

How to find a good starting LR

Scheduling — mandatory in practice

ScheduleShapeWhen to use
ConstantFlatDebugging; almost never in production
Step decayCliff drops every N epochsComputer vision legacy; interpretable
ExponentialSmooth geometric decayGeneral-purpose
Cosine annealingHalf-cosine down to zeroDefault for most modern nets
Warmup + cosineLinear ramp then cosineTransformers, large batches, deep nets
SGDR (warm restarts)Periodic cosine restartsEnsembling snapshots; exploring basins
Reduce on plateauDrop when val loss stallsReactive; good when you don't know epoch count
Cyclical LR (CLR)Triangular wavesAlternative to SGDR; exploration-heavy

Linear scaling rule

When multiplying the batch size by k, multiply the LR by k (Goyal et al. 2017). This keeps the effective step size per epoch roughly constant. Valid up to very large batch sizes (~8k); beyond that, you need warmup or LARS to prevent early divergence.

LR × optimizer interaction

💡 A common recipe that just works

  • AdamW optimizer, LR = 3e-4, weight decay = 0.01.
  • Linear warmup for 500–2000 steps, then cosine decay to zero.
  • Gradient clipping at norm 1.0.
  • Adjust LR ±3× based on LR range test output.

10 · Regularization

Anything that discourages the model from memorizing the training set. Most regularizers work by imposing a prior — on the weights, on the activations, or on the data itself.

Weight-space regularization

Add a penalty on the weights to the loss:

Ltotal = Ldata(w) + λ · R(w)

L2 vs weight decay in Adam

For SGD, adding λ‖w‖² to the loss and subtracting λw from the gradient are mathematically identical. For Adam, they're not — the adaptive denominator in Adam distorts the L2 gradient, effectively scaling the regularization per-parameter. AdamW fixes this by applying weight decay directly to the weights, outside the adaptive machinery. Always use AdamW over Adam + L2.

Dropout

During training, independently drop each activation with probability p. At inference, keep all activations and scale by (1 − p) (or use inverted dropout: scale training activations by 1/(1−p) instead). This is approximately equivalent to training an exponentially large ensemble of sub-networks and averaging at inference.

Normalization as regularization

Batch normalization and layer normalization are primarily optimization tools (they rescale activations to stabilize training), but they also act as mild regularizers. BN adds noise via the per-batch statistics; LN doesn't, but does restrict the hypothesis class. With BN present, dropout is often redundant or counterproductive.

Data-space regularization

Usually stronger than weight-space regularization for modern deep nets:

Early stopping

Monitor validation loss; stop when it hasn't improved for N epochs (the "patience"). Save the best-so-far checkpoint and restore it at the end. Free, effective, always enable it.


11 · Hyperparameter Search

The outer loop around training. You can't gradient-descend through the choice of architecture, the learning rate, or the batch size — you have to search over them explicitly.

The three classical strategies

StrategyMechanismStrengthsWeaknesses
Grid searchEvaluate every combinationExhaustive; easy to parallelize; reproducibleCost is exponential in dimensions; wastes budget on insensitive axes
Random searchSample uniformly from the spaceUsually beats grid at equal budget (Bergstra & Bengio 2012); trivially parallelNo learning from past trials
Bayesian optFit surrogate, use acquisition function to pick nextSample-efficient; principled exploration/exploitationSerial bottleneck; surrogate can be miscalibrated

Why random beats grid

In most problems, a few hyperparameters matter a lot and many matter very little. Grid search wastes budget sampling the irrelevant axes densely. Random search naturally allocates more distinct values to the axes that matter. This was the surprising empirical finding of Bergstra & Bengio: with the same number of trials, random search finds better configurations than grid search on most deep-learning problems.

Bayesian optimization in detail

Two components:

Multi-fidelity methods

When each trial is expensive (minutes to days), you can't afford many. Multi-fidelity methods start many trials cheaply and kill bad ones early:

Practical guidelines

🚨 Traps

  • Tuning on the test set. All hyperparameters must be chosen using the validation set.
  • Using the same seed for all trials — masks variance.
  • Searching too wide a space early. Exploration budget scales poorly with space volume.
  • Forgetting to fix the random seed for the data split — different folds produce different "best" hyperparameters.

12 · Backpropagation

Reverse-mode automatic differentiation applied to a computation graph. The algorithm that makes deep learning computationally feasible.

The core idea

Given a computation graph representing the loss as a composition of differentiable operations, backpropagation applies the chain rule backwards from the output to every parameter. The key trick: each intermediate gradient is computed exactly once and reused.

Why reverse-mode?

For a function f: ℝⁿ → ℝᵐ (n inputs, m outputs), you have two choices:

Neural network training has m = 1 (the scalar loss) and n = millions (the weights). Reverse-mode wins by a factor of n/m = millions. This is why all deep-learning frameworks use reverse-mode.

The forward-backward pattern

# forward pass
h₁ = W₁ · x
a₁ = ReLU(h₁)
h₂ = W₂ · a₁
ŷ = softmax(h₂)
L = CE(ŷ, y)

# backward pass — apply chain rule
dL/dŷ   = ŷ − y                         # softmax + CE
dL/dh₂  = dL/dŷ                         # (uses softmax jacobian)
dL/dW₂  = dL/dh₂ · a₁ᵀ
dL/da₁  = W₂ᵀ · dL/dh₂
dL/dh₁  = dL/da₁ ⊙ (h₁ > 0)             # ReLU derivative
dL/dW₁  = dL/dh₁ · xᵀ

Complexity

The backward pass costs roughly the same as the forward pass — within a constant factor of 2–3×. Memory is the bigger cost: to compute gradients, you must cache all intermediate activations from the forward pass. This is why memory usage scales with depth × batch size × activation size.

Gradient flow issues

Memory-saving tricks

Dynamic vs static graphs


13 · Feedforward Neural Networks (MLPs)

The simplest deep architecture: stacked affine transformations with nonlinearities between them. The "Hello, World!" of neural networks and still the right tool for plenty of tabular problems.

Structure

a⁽ˡ⁺¹⁾ = φ(W⁽ˡ⁾ a⁽ˡ⁾ + b⁽ˡ⁾)

Each layer is an affine transform followed by an elementwise nonlinearity φ. No memory, no recurrence, no spatial awareness — just a stack of functions that takes an input vector and produces an output vector.

Activation functions

NameFormulaCharacteristics
Sigmoid1/(1+e⁻ˣ)Saturates, vanishing gradients, output in (0,1). Mostly replaced.
Tanh(eˣ−e⁻ˣ)/(eˣ+e⁻ˣ)Zero-centered sigmoid. Used in some RNNs.
ReLUmax(0, x)Default for hidden layers. Sparse, non-saturating. Can "die" (output stuck at 0).
Leaky ReLUmax(αx, x), α≈0.01Fixes dying ReLU.
GELUx·Φ(x)Smooth, used in transformers (BERT, GPT).
Swish/SiLUx·sigmoid(x)Smooth, slight edge over ReLU on deep nets.
Softmaxeᶻⁱ/ΣeᶻʲOutput layer for multiclass classification.

Universal Approximation Theorem

A feedforward network with a single hidden layer of sufficient width and a non-polynomial activation can approximate any continuous function on a compact domain to arbitrary accuracy (Cybenko 1989, Hornik 1991). This is a theoretical guarantee of expressive power, not a practical recipe — in practice, deep narrow networks are far more sample-efficient than shallow wide ones.

When to use an MLP

When not to use an MLP

Initialization


14 · Recurrent Neural Networks

Feedforward networks augmented with a hidden state that feeds back in at the next timestep. The classical way to process sequences, largely supplanted by Transformers but still relevant on resource-constrained devices.

The core recurrence

ht = φ(Wx xt + Wh ht−1 + b)
yt = Wy ht + by

The same weights Wx, Wh are reused across all timesteps — this is the weight-sharing that makes RNNs work on sequences of arbitrary length with a fixed parameter count.

Training: backprop through time (BPTT)

Unroll the RNN for T timesteps, apply standard backpropagation through the unrolled graph. Gradient of the loss with respect to Wh involves products of T jacobians — which leads directly to the core problem:

Vanishing and exploding gradients

If the spectral radius of Wh is less than 1, gradients shrink exponentially as they propagate backward through time → vanishing gradients → can't learn long-range dependencies. If greater than 1, gradients blow up → exploding gradients → training diverges. Plain RNNs struggle to remember anything more than ~10 steps back.

LSTM — the classic fix

Long Short-Term Memory (Hochreiter & Schmidhuber 1997) adds a separate cell state that flows through the sequence with only minor modifications at each step, controlled by three gates:

The cell state's update is approximately additive, so gradients can flow through it without vanishing. LSTMs can learn dependencies hundreds of steps long.

GRU — the simpler cousin

Gated Recurrent Unit (Cho et al. 2014) merges the forget and input gates into a single "update gate" and combines the cell and hidden states. Fewer parameters than LSTM, often similar performance. A reasonable default when you don't want to debate LSTM vs GRU.

Variants and tricks

Why Transformers won

Where RNNs still make sense


15 · Quantization

Reduce numeric precision of weights and activations — typically FP32 → INT8 — to shrink the model, accelerate inference, and lower energy consumption. Essential for edge deployment.

The basic mapping

q = round(x / s) + z    x ≈ s · (q − z)

where s is the scale and z is the zero point. Pick them so that the quantized range [qmin, qmax] (e.g., [−128, 127] for signed INT8) covers the float range [xmin, xmax] accurately.

Symmetric vs asymmetric

Standard practice: symmetric for weights (roughly zero-mean), asymmetric for activations.

Granularity

Post-Training Quantization (PTQ)

Take a trained FP32 model, run a calibration dataset through it to collect activation statistics, compute the scales and zero points, convert the weights. No retraining required.

PTQ typically loses 0–2% accuracy at INT8 on standard vision models, 3–5% on more sensitive architectures. At INT4 and below, PTQ usually isn't enough.

Quantization-Aware Training (QAT)

Insert "fake quantization" nodes into the model during training — they simulate the rounding and clipping of the target bit-width while keeping gradients flowing (straight-through estimator). The model learns weights that are robust to quantization noise.

QAT recovers almost all of the PTQ accuracy loss and is necessary for INT4, INT2, and binary networks. Costs about 1 additional training epoch's worth of effort.

Hardware support

PlatformPreferred formats
NVIDIA GPU (Ampere+)FP16, BF16, INT8 via Tensor Cores; FP8 on Hopper
Google TPUBF16 training, INT8 inference
ARM Cortex-A / NeonINT8 matmul via NEON intrinsics
ARM Cortex-M / CMSIS-NNINT8, INT16 fixed-point
TriCore / automotive MCUsFixed-point (usually INT16 or INT32 with custom scaling)

Typical wins

🚨 Traps

  • Calibrating on an unrepresentative dataset produces miscalibrated scales and silent accuracy loss.
  • Per-tensor quantization of weights with wide dynamic range (one huge outlier channel) destroys accuracy — always use per-channel for weights.
  • Layer norm and softmax are numerically sensitive — often kept in FP16 even in an otherwise INT8 model.
  • Depthwise convolutions are unusually sensitive to quantization; may need QAT even if the rest of the model doesn't.

16 · Pruning

Identify and remove the weights (or neurons, or channels, or whole layers) that contribute least to the network's output. Shrinks the model and, if done structurally, speeds up inference.

Granularity: the key choice

GranularityWhat's removedCompressionSpeedup on dense HW
Unstructured (weight-level)Individual weightsVery high (90%+ achievable)None — produces a sparse matrix
Vector / row / columnSub-blocks of a weight matrixHighSome, with structured sparse kernels
Filter / channelEntire output channels of a conv or FC layerModerateFull speedup on any HW
Layer / blockWhole layers (often in ResNets)ModestFull speedup

The fundamental tradeoff: unstructured pruning gives the highest compression ratio, but the resulting sparse weight matrix doesn't run any faster on a dense GEMM (the standard CPU/GPU matmul). To get actual speedup, you need structured pruning, or specialized sparse hardware (NVIDIA 2:4 sparsity on Ampere+, some mobile accelerators).

Pruning criteria

The prune–fine-tune cycle

  1. Train a model to convergence.
  2. Compute pruning scores for every weight/channel/filter.
  3. Remove the bottom X% (or set them to zero and mask them out).
  4. Fine-tune for a few epochs to recover accuracy.
  5. Repeat (iterative pruning) — often outperforms one-shot pruning of the same total amount.

Iterative Magnitude Pruning (IMP) and the Lottery Ticket Hypothesis

Frankle & Carbin (2018) showed that for many networks, if you prune a trained network heavily (say 90%), then reset the surviving weights to their initial values and retrain from scratch — the sparse network trains to nearly the original accuracy. This suggests that dense networks contain small "winning ticket" subnetworks that, once found, are sufficient on their own.

One-shot vs gradual pruning

Practical guidance


17 · Knowledge Distillation

Train a small "student" model to imitate a large "teacher" — not just its predictions, but its full probability distribution. Transfers information beyond what hard labels convey.

The Hinton formulation (2015)

Let z be the logits of a model. Standard softmax gives p = softmax(z). Add a temperature T > 1:

pi(T) = exp(zi / T) / Σj exp(zj / T)

Higher T produces a softer distribution (all probabilities pulled toward uniform). The student is trained to match the teacher's softened distribution and the hard labels:

L = α · CE(student, hard) + (1−α) · T² · KL(studentT, teacherT)

The T² factor is there to keep the distillation gradient magnitude independent of T.

Why it works: dark knowledge

A teacher that outputs [0.7, 0.2, 0.08, 0.02] for four classes is saying more than just "class 1." The relative probabilities encode similarity structure: class 2 is "more like" class 1 than class 4 is. This information is present in the teacher but absent from the one-hot hard label. Temperature amplifies it.

Variants

Practical notes

Common pitfalls


18 · Compression Stack for Edge Deployment

Running neural networks on microcontrollers, automotive ECUs, or battery-constrained devices. The end-to-end pipeline that takes a GPU-trained research model down to a ~100 KB binary running at 1 ms latency.

The stack, in order

  1. Start with the right architecture. Don't try to compress a ResNet-152 for a microcontroller; start with MobileNet, EfficientNet-Lite, SqueezeNet, ShuffleNet, or a hand-designed tiny model. Architecture search tailored for mobile (NetAdapt, MnasNet, MCUNet) can find better starting points.
  2. Distillation. Train a larger, more accurate teacher model, then distill knowledge into the smaller target architecture. Typical wins: 1–3% accuracy at no inference-time cost.
  3. Pruning. Structured pruning (filters / channels / heads) to get real speedup on dense hardware. Iterative, with fine-tuning between rounds. Typical wins: 30–70% FLOP reduction at < 1% accuracy loss.
  4. Quantization. INT8 as the standard target. QAT if PTQ isn't enough or you need INT4. Typical wins: 4× smaller, 2–4× faster.
  5. Compile. Convert to a deployment format (TFLite, ONNX Runtime, TensorRT, CoreML, OpenVINO), let the compiler fuse operators, lay out memory, and tile for the target hardware. Often another 1.5–3× speedup on top.

Target hardware and toolchains

TargetToolchainTypical use
NVIDIA GPU (server)TensorRTData-center inference
Intel CPU/iGPUOpenVINOIndustrial / edge server
Mobile phone (ARM)TFLite, Core ML, PyTorch MobileConsumer apps
Microcontroller (Cortex-M)TFLite Micro, CMSIS-NN, microTVMSensor processing, keyword spotting
Automotive MCU (e.g. Infineon AURIX / TriCore)Vendor SDKs + hand-tuned C, TargetLink, custom fixed-point codeReal-time control, safety-critical
Mobile NPU / DSPQualcomm SNPE, Huawei HiAI, Samsung EdenOn-device ML

Constraints that drive the pipeline

Multiplicative effect

Each stage multiplies the previous gains:

architecture (5×) × distillation (accuracy margin) × pruning (3×) × quantization (4×) × compiler (2×) ≈ 100× total

Real numbers vary but 50–200× total reduction from initial research model to deployed binary is typical.

Monitoring in production

💡 Automotive / motor-control specifics

  • Fixed-point arithmetic with Q-format scaling is still the norm on TriCore and Cortex-M4 targets — INT8 quantization-aware training maps cleanly onto this.
  • ASIL constraints often exclude any library (including neural-net inference libraries) that hasn't been certified. Certified options exist (TUV-SUD-certified TFLite variants, Infineon Aurix AI stacks) but the certification process influences architecture choices from the start.
  • Determinism trumps peak performance: a 500 µs consistent worst-case inference time is preferable to a 200 µs average with 2 ms tail.
  • The AUTOSAR / TargetLink toolchain typically consumes the trained model via generated C code with static LUTs and fixed-point matmul kernels, not an ML runtime.