Neural Networks — Interactive Concepts

Dataset · Training, Validation, Test

Any ML model is only as honest as the way you split its data. Three disjoint buckets — and they serve very different purposes.

Interactive split — drag to re-allocate 10,000 samples

Training % 70%

Validation % 15%

Train · 7000

Val · 1500

Test · 1500

Training set

Model sees this. Weights update from its gradients. Biggest slice — usually 70–80%.

Validation set

Model never trains on this, but you peek every epoch to tune hyperparameters, pick model, early-stop. 10–20%.

Test set

Locked away. Touched once at the end to report honest generalization. 10–20%.

Practical essentials

Data leakage — if validation or test information sneaks into training (e.g. normalizing with statistics from the full dataset), reported accuracy is optimistic.
k-fold cross validation — for small datasets, split the training data into k folds and rotate. Gives you a mean ± std of performance instead of one noisy number.
Stratified splits — for classification, preserve class ratios in each split.
Time-series — never shuffle. Always split chronologically to prevent leaking the future into training.

Signal Conditioning & Preprocessing

Before the model sees data, you clean it. Real sensor streams — motor currents, encoder positions, temperatures — arrive noisy, with outliers, and with behavior that shifts across operating regimes.

Savitzky–Golay smoothing · on a noisy current-like signal

Window size (odd) 15

Polynomial order 3

Raw noisy signal Savitzky–Golay smoothed True underlying

SG fits a degree-P polynomial to a sliding window of W samples (W odd, W > P) by least-squares, then replaces the center sample with the polynomial's value there. Unlike a moving average, it preserves peaks, amplitudes, and derivatives — exactly what you want when the features themselves carry information.

Outlier removal · z-score threshold

z-threshold 3.0σ

Samples kept

—

Outliers flagged

—

The shaded red band is outside the z-threshold. Points in red are flagged. Drop the threshold and you lose legitimate data; raise it and outliers survive into the model.

Temperature grouping · regime-aware analysis

Physical systems behave differently across operating regimes. Binning data by temperature (or any regime variable) lets you compute per-group statistics, train specialized models per regime, or simply verify that one global model generalizes across conditions.

Number of bins 4

Raw samples Per-bin mean ± 1σ

Typical preprocessing pipeline

Anti-alias filter before any downsampling — critical for high-frequency motor signals.
DC removal / detrending — subtract mean or a slow drift.
Outlier detection — z-score, IQR, Hampel, or model-based residuals.
Smoothing — SG for feature-preserving noise rejection, Kalman for model-based state estimation.
Normalization — z-score or min–max, fit on training data only.
Feature engineering — FFT bins, rolling stats, park-transformed d/q quantities.
Regime labeling — tag each sample with operating point (temperature, speed, load) for stratified validation.

Practical essentials

SG vs moving average — SG preserves curvature and peak height; MA flattens both. SG derivatives are smooth and usable (e.g. differentiating encoder position to get velocity).
Outlier methods — z-score assumes Gaussian; IQR is robust to skew; Hampel uses median absolute deviation and is robust to clustered outliers; model-based uses residuals from a first-pass fit.
Data leakage trap — fit scalers (mean, std, min, max) on the training set only. Apply the same transform to val and test. Fitting on the full dataset leaks test statistics into training.
Regime-stratified validation — even with one global model, always analyze residuals per regime. Large per-bin errors hiding in a decent overall RMSE are a classic deployment failure.

Overfitting vs Underfitting

The central tension of ML. Move the slider — watch the model go from too stupid to too clever.

Polynomial regression · change model complexity

Polynomial degree 3

Training points Validation points True function Model prediction

Training error

—

Validation error

—

Diagnosis

—

Bias–Variance curve

As complexity ↑: training error keeps dropping, validation error drops then rises. The sweet spot is the minimum of the orange curve.

Underfit High bias · low variance

Model too simple to capture pattern
High error on both training and validation
Fix: bigger model, more features, train longer, reduce regularization

Overfit Low bias · high variance

Model memorizes training noise
Low training error, high validation error — big gap
Fix: more data, regularization (L1/L2, dropout), early stopping, data augmentation, smaller model

How to diagnose in practice

Plot training loss and validation loss versus epoch. The shape tells you which regime you're in.
If both losses are high and close → underfit. If training keeps dropping while validation rises → overfit.
Early stopping is the cheapest regularizer — freeze at the validation-loss minimum.

Gradient Descent

How neural networks actually learn. Click on the loss surface to drop a ball — watch it roll toward a minimum.

Interactive 2-D loss surface

Learning rate 0.040

Loss surface

Click anywhere on the surface to place the starting point. Darker regions = lower loss.

The update rule

w ← w − η · ∇_w L(w)

w

Model weights (a long vector in practice)

η

Learning rate — how big a step to take

∇L

Gradient of the loss w.r.t. weights — direction of steepest ascent, so we move opposite

L(w)

Loss function — MSE, cross-entropy, etc.

Three flavors

Batch GD

Full dataset per step. Smooth path. Slow on big data. Rare today.

Mini-batch SGD

~32–512 samples per step. Noisy but fast, escapes shallow minima. The default.

Pure SGD

One sample per step. Very noisy. Mostly theoretical now.

Beyond vanilla GD

Momentum — accumulates a velocity; smooths out oscillations and powers through flat regions.
Adam — per-parameter adaptive learning rate (running averages of gradient and squared gradient). Robust default for most problems.
LR scheduling — cosine decay, warmup, step decay. Large LR early for exploration, small LR late for precision.
Vanishing / exploding gradients — chain-rule products shrink or blow up in deep nets. Fix with ReLU, batch norm, residual connections, gradient clipping.

Learning Rate Schedules

The single most important hyperparameter. Scheduling it over the course of training — not leaving it fixed — is standard practice.

Pick a schedule and watch the LR curve

Scheduler

Initial LR × 1000 0.010

Total epochs 100

Picking and tuning the LR

LR range test (Smith 2017) — sweep LR from very small to large over a few epochs, plot loss vs LR, pick an order of magnitude below the point where loss stops decreasing.
Linear scaling rule — if you multiply batch size by k, multiply LR by k (up to a point). Essential for multi-GPU.
Warmup — 500–2000 steps of linear ramp from 0. Prevents early catastrophic divergence in deep / transformer networks.
Cosine annealing — smooth decay that ends near zero. Small tail LR helps settle into a narrow minimum.
Warm restarts (SGDR) — periodically reset LR to explore multiple basins. Snapshots at each restart can be ensembled.
Adam vs SGD — Adam adapts per-parameter; a global schedule still helps but is less critical than for plain SGD.

Regularization · L1, L2, Dropout

Techniques to stop the model memorizing the training set. Each one applies a different prior to the weights.

The regularized loss

L = L_data(w)  +  λ · R(w)

L2 (ridge)

R(w) = ½‖w‖² — Gaussian prior. Shrinks all weights smoothly toward zero. "Weight decay" in SGD ≡ L2.

L1 (lasso)

R(w) = ‖w‖₁ — Laplace prior. Pushes many weights to exactly zero → sparsity and implicit feature selection.

Elastic Net

α‖w‖₁ + (1−α)½‖w‖² — combination. Inherits sparsity from L1 and stability from L2.

Dropout

Randomly zero activations at rate p during training. Like training an ensemble of sub-networks. At inference, use the full network with activations scaled.

Effect on weights · side-by-side histograms

λ (regularization strength) 0.00

L1 exact zeros

—

L2 mean |w|

—

Dropout — visually

Dropout rate p 0.30

At each training step, each hidden neuron is independently dropped with probability p. A different subnetwork trains every step — that's what makes dropout a strong regularizer. At inference, all neurons stay active and activations are scaled by (1−p) (or training-time activations are scaled by 1/(1−p), a.k.a. "inverted dropout").

When to use what

L2 — almost always on. Start with λ = 1e-4 for most problems.
L1 — when you want sparsity or feature selection, or when you suspect only a few inputs really matter.
Dropout — classic for fully-connected layers and early CNNs. Mostly replaced by BN/LN + data augmentation in modern vision. Still used in some Transformer blocks (attention dropout).
Early stopping — simplest regularizer. Freeze at the validation-loss minimum.
Data augmentation — flipping, cropping, noise injection, mixup. Often more effective than weight-space regularization.
Label smoothing — replace hard 0/1 labels with e.g. 0.05/0.95. Prevents overconfident outputs and helps calibration.
Weight decay ≠ L2 in Adam — use AdamW if you want true decoupled weight decay.

Hyperparameter Search

The outer loop around training. How do you find the best combination of LR, batch size, dropout, depth, width? Three strategies — watch them search the space.

Grid vs Random vs Bayesian · live comparison

Method Budget 25

Trial evaluated Best so far True optimum (hidden from search)

Convergence · best objective vs trial number

Good methods find good regions early and close the gap to the true optimum quickly. Bayesian optimization's advantage grows with cost per trial.

Three strategies

Grid search

Evaluate every combo on a regular grid.
Easy to reason about, trivially parallel.
Cost explodes exponentially with dimensions.
Wastes budget on dimensions that don't actually matter.

Random search

Uniformly sample hyperparameters.
Usually beats grid at equal budget (Bergstra & Bengio 2012).
Still embarrassingly parallel.
No learning between trials — ignores everything seen so far.

Bayesian optimization

Fit a surrogate model of the objective (Gaussian Process, Random Forest, or Tree-structured Parzen Estimator).
Pick the next trial using an acquisition function — Expected Improvement, Upper Confidence Bound, or Thompson sampling — which balances exploration (high uncertainty) against exploitation (high predicted value).
Sample-efficient: each trial learns from every previous one.
Harder to parallelize; somewhat serial.
Gold standard when each trial is expensive (a full training run).

What to know in practice

Log scale for learning rate, weight decay, regularization strength. Linear for integers like layer count.
Libraries — Optuna, Hyperopt, Ray Tune, Weights & Biases Sweeps, scikit-optimize.
Multi-fidelity methods — Hyperband & BOHB start many trials cheaply, kill bad ones early, give survivors more budget. Essential when full training is expensive.
TPE (Tree-structured Parzen Estimator) — default in Optuna/Hyperopt. Handles conditional and discrete hyperparameters better than vanilla GP.
ASHA — asynchronous Hyperband; 10-100× speedup vs naive Bayesian opt for deep-learning sweeps.
Don't tune on the test set — tune on validation. Your test metric is only honest if hyperparameters were never selected using it.

The Training Loop

Everything above, in order. The 6 steps that turn raw weights into a working model.

Key vocabulary

Epoch — one full pass through the training set.
Iteration / step — one weight update (one mini-batch).
Batch size — samples per step. Bigger = more stable gradients, more memory, sometimes worse generalization.
Loss function — MSE for regression, cross-entropy for classification, custom for structured outputs.
Backprop — reverse-mode automatic differentiation. Complexity ≈ same as forward pass.

Feedforward Neural Network

The vanilla architecture. Data flows strictly left-to-right. No loops, no memory. Universal approximator for static input→output mappings.

Watch a forward pass

Speed

Each neuron computes a = σ(Σ wᵢxᵢ + b). Line thickness ≈ |weight|. Brightness ≈ activation strength.

Math in one line per layer

h_ℓ = σ( W_ℓ · h_ℓ-1 + b_ℓ )

Input layer

Raw features — no computation, just values.

Hidden layers

Affine transform + nonlinear activation. Stack of these = depth.

Output layer

Task-specific activation: softmax (classification), linear (regression), sigmoid (binary).

Activations

ReLU (default), GELU, tanh, sigmoid. Must be nonlinear or the whole net collapses to a single linear map.

Strengths

Simple, parallelizable, fast to train
Universal approximator (enough width/depth)
Great for tabular data, static regression/classification

Limitations

No memory of past inputs — can't handle sequences natively
No spatial awareness — needs CNNs for images
Parameter count explodes with input dimension (fully connected)

Recurrent Neural Network

A feedforward net with a memory loop. Same weights applied at every timestep — built for sequences.

Rolled vs unrolled view

Timesteps 4

The recurrence

h_t = σ( W_hh · h_t-1 + W_xh · x_t + b )
y_t = W_hy · h_t

The hidden state h_t is a running summary of everything seen so far. Weights W_hh, W_xh, W_hy are shared across all timesteps — that's what makes it a recurrent net.

Why vanilla RNNs fail — and what fixed it

Vanishing/exploding gradients — gradients flow through the same weight matrix at every step. Long sequences = products blow up or decay to zero.
LSTM (1997) — adds input/forget/output gates and a protected cell state. Lets information flow across many timesteps.
GRU — simpler LSTM (2 gates, no separate cell state). Usually comparable performance, fewer parameters.
Transformers (2017) — replaced recurrence with self-attention. Parallelizable, scales to huge contexts. Dominant architecture today for sequences (NLP, audio, even vision).
When RNNs still make sense — streaming inference, constrained embedded devices, online control loops where you truly only have O(1) memory per step.

Quantization

Replace 32-bit floats with fewer-bit integers. 4× smaller, ~2–4× faster, almost the same accuracy — if you do it right.

FP32 → INT8 · watch the weight histogram

Bit-width 8-bit

Discrete levels

256

Memory footprint

25% of FP32

Max quant. error

—

Two ways to do it

Post-Training Quantization (PTQ)

Train in FP32 → calibrate scales on a few hundred samples → quantize weights & activations. Fast, no retraining. Usually fine for 8-bit.

Quantization-Aware Training (QAT)

Simulate quantization during training via fake-quant ops. Weights learn to be quantization-friendly. Needed for 4-bit and below.

What it costs, what you gain

Memory: FP32→INT8 ≈ 4× smaller; INT4 ≈ 8×.
Compute: Integer MACs are cheaper on most silicon (especially edge/embedded like TriCore, Jetson, mobile NPUs).
Accuracy: <1% drop typical at INT8 with good calibration. Degrades faster below 4-bit without QAT.
Symmetric vs asymmetric: symmetric uses a zero-centered scale (simpler); asymmetric adds a zero-point (better for ReLU outputs that are ≥0).
Per-tensor vs per-channel: per-channel scales preserve accuracy for conv layers with very different weight ranges per filter.

Pruning

Most weights in a trained network are near zero and contribute little. Zero them out — the network barely notices.

Magnitude pruning · drag the threshold

Sparsity target 30%

Weights remaining

70%

Zeroed-out

30%

Simulated accuracy drop

—

Structured vs unstructured

Unstructured (fine-grained)

Zero individual weights by magnitude
Highest compression ratio (often 90%+)
Needs sparse-matrix hardware to actually speed up

Structured (channel / filter / head)

Remove whole neurons/channels/filters
Less compression, but real speedup on any GPU/CPU
Preferred for deployment

The standard recipe

Train → prune → fine-tune. Single-shot pruning hurts; iterative pruning + retraining recovers most of the accuracy.
Lottery ticket hypothesis (Frankle & Carbin, 2019) — a randomly-initialized dense net contains sparse subnetworks ("winning tickets") that, when trained in isolation, match the dense net's accuracy.
Combine with quantization — pruning + INT8 is standard for edge deployment. Often 10–20× end-to-end compression.

Knowledge Distillation

Train a small "student" model to mimic a big "teacher" model. The student learns from the teacher's soft probabilities, not just the hard labels.

Teacher → Student

Raising the temperature T in softmax(z/T) softens the teacher's distribution — revealing relative similarities between classes ("dark knowledge") that a hard label can't express.

Why soft labels help

A hard label for an image of a dog says: [1, 0, 0, 0] (dog, cat, truck, plane). A teacher's soft probs might say: [0.90, 0.09, 0.005, 0.005] — it thinks "cat" is somewhat plausible but "truck" is not. That second-order information is extra supervision the student learns from.

Variants and gotchas

Response-based (Hinton 2015) — match final logits. Simplest and most common.
Feature-based — match intermediate layer activations. Useful when student architecture differs a lot.
Self-distillation — teacher and student have the same architecture; still helps.
Combine with pruning + quantization — the standard compression pipeline for edge deployment.
Temperature — typical T between 2 and 10. Too high = flat distribution, no signal. Scale loss by T² to keep gradient magnitudes right.

Concept Cheat Sheet

The 60-second answer for each concept. If you can say this out loud cleanly, you've got it.

Want the long version?

Each term below has a full deep-dive — math, variants, practical tips, and the pitfalls to watch for.

Open Deep Dive Reference ↗

Train / Val / Test

Three disjoint splits. Train fits weights. Val tunes hyperparams and picks early-stop point. Test is looked at once for the final honest number. Prevent leakage between them.

Signal preprocessing

Anti-alias → detrend → outlier removal → smooth → normalize → feature-engineer → label regime. Fit scalers on training data only.

Savitzky–Golay

Local polynomial least-squares smoother. Preserves peaks and derivatives. Key params: window size (odd), polynomial order (< window).

Outlier removal

Z-score (Gaussian), IQR (robust to skew), Hampel (MAD-based, robust to clusters of outliers), model-based residuals.

Temperature grouping

Bin data by operating regime (temperature, speed, load). Compute per-bin statistics, stratify validation, optionally train separate models per bin.

Underfitting

Model too simple → high error on both train and val → add capacity, train longer, fewer constraints.

Overfitting

Model memorizes training noise → low train error, high val error → regularize (L1/L2, dropout), more data, early stopping, data augmentation.

Gradient descent

w ← w − η∇L(w). Move weights opposite to gradient of loss. Mini-batch SGD is the practical default; Adam is the robust adaptive version.

Learning rate

Most important hyperparameter. Schedule it: cosine annealing (default), warmup + cosine (transformers), step decay, SGDR (warm restarts), reduce-on-plateau. Linear scaling rule for batch size.

Regularization

L2 shrinks weights smoothly (always on, λ~1e-4). L1 drives weights exactly to zero (sparsity). Dropout zeros activations during training (p=0.2–0.5). Early stopping is the cheapest.

Hyperparameter search

Grid (exhaustive, scales badly), Random (usually beats grid, parallel), Bayesian (surrogate + acquisition function, sample-efficient, serial). Multi-fidelity (Hyperband/BOHB/ASHA) kills bad trials early.

Backpropagation

Reverse-mode autodiff: apply chain rule backwards through the computation graph. Same complexity as forward pass.

Feedforward NN

Stacked affine + nonlinear layers. Universal approximator for static inputs. No memory.

Recurrent NN

Feedforward + hidden state that feeds back in. Same weights across timesteps. Struggles with long sequences (vanishing grads) — LSTM/GRU fix it, transformers eventually replaced it for most tasks.

Quantization

Reduce numeric precision (FP32 → INT8/INT4). ~4–8× smaller, faster integer ops, <1% accuracy loss at INT8. Two flavors: post-training (PTQ, no retraining) and quantization-aware training (QAT, needed for very low bit-widths).

Pruning

Remove weights below a magnitude threshold. Unstructured = max compression, needs sparse HW. Structured (channels/filters) = real speedup on any HW. Prune → fine-tune → repeat.

Knowledge distillation

Small student trained to match large teacher's soft probability distribution (softmax(z/T)) plus the hard label. Transfers "dark knowledge" about class similarities. Enables huge compression with minimal accuracy loss.

Compression stack for edge

Distill → prune → quantize → deploy. Each step is multiplicative — can give 50–100× total reduction.

Red flags to watch for in practice

"Your training loss keeps dropping but validation flattens — what do you do?" → overfit, add regularization / more data / earlier stopping.
"Learning rate too high — what happens?" → loss oscillates or diverges. Too low → slow / stuck in saddle.
"Why doesn't vanilla RNN work on long sequences?" → vanishing/exploding gradients through repeated multiplication of the same recurrent weight matrix.
"Difference between L1 and L2 regularization?" → L1 pushes weights exactly to zero (sparsity); L2 just shrinks them.
"Why is batch size a big deal?" → affects gradient noise, convergence speed, memory, and generalization. Small batches regularize; huge batches need LR warmup and scaling.
"When would you choose distillation over pruning?" → when you need architectural freedom in the student (different depth/width). Pruning keeps the same architecture.