Dataset · Training, Validation, Test
Any ML model is only as honest as the way you split its data. Three disjoint buckets — and they serve very different purposes.
Interactive split — drag to re-allocate 10,000 samples
Training set
Model sees this. Weights update from its gradients. Biggest slice — usually 70–80%.
Validation set
Model never trains on this, but you peek every epoch to tune hyperparameters, pick model, early-stop. 10–20%.
Test set
Locked away. Touched once at the end to report honest generalization. 10–20%.
Practical essentials
- Data leakage — if validation or test information sneaks into training (e.g. normalizing with statistics from the full dataset), reported accuracy is optimistic.
- k-fold cross validation — for small datasets, split the training data into k folds and rotate. Gives you a mean ± std of performance instead of one noisy number.
- Stratified splits — for classification, preserve class ratios in each split.
- Time-series — never shuffle. Always split chronologically to prevent leaking the future into training.
Signal Conditioning & Preprocessing
Before the model sees data, you clean it. Real sensor streams — motor currents, encoder positions, temperatures — arrive noisy, with outliers, and with behavior that shifts across operating regimes.
Savitzky–Golay smoothing · on a noisy current-like signal
SG fits a degree-P polynomial to a sliding window of W samples (W odd, W > P) by least-squares, then replaces the center sample with the polynomial's value there. Unlike a moving average, it preserves peaks, amplitudes, and derivatives — exactly what you want when the features themselves carry information.
Outlier removal · z-score threshold
The shaded red band is outside the z-threshold. Points in red are flagged. Drop the threshold and you lose legitimate data; raise it and outliers survive into the model.
Temperature grouping · regime-aware analysis
Physical systems behave differently across operating regimes. Binning data by temperature (or any regime variable) lets you compute per-group statistics, train specialized models per regime, or simply verify that one global model generalizes across conditions.
Typical preprocessing pipeline
- Anti-alias filter before any downsampling — critical for high-frequency motor signals.
- DC removal / detrending — subtract mean or a slow drift.
- Outlier detection — z-score, IQR, Hampel, or model-based residuals.
- Smoothing — SG for feature-preserving noise rejection, Kalman for model-based state estimation.
- Normalization — z-score or min–max, fit on training data only.
- Feature engineering — FFT bins, rolling stats, park-transformed d/q quantities.
- Regime labeling — tag each sample with operating point (temperature, speed, load) for stratified validation.
Practical essentials
- SG vs moving average — SG preserves curvature and peak height; MA flattens both. SG derivatives are smooth and usable (e.g. differentiating encoder position to get velocity).
- Outlier methods — z-score assumes Gaussian; IQR is robust to skew; Hampel uses median absolute deviation and is robust to clustered outliers; model-based uses residuals from a first-pass fit.
- Data leakage trap — fit scalers (mean, std, min, max) on the training set only. Apply the same transform to val and test. Fitting on the full dataset leaks test statistics into training.
- Regime-stratified validation — even with one global model, always analyze residuals per regime. Large per-bin errors hiding in a decent overall RMSE are a classic deployment failure.
Overfitting vs Underfitting
The central tension of ML. Move the slider — watch the model go from too stupid to too clever.
Polynomial regression · change model complexity
Bias–Variance curve
As complexity ↑: training error keeps dropping, validation error drops then rises. The sweet spot is the minimum of the orange curve.
Underfit High bias · low variance
- Model too simple to capture pattern
- High error on both training and validation
- Fix: bigger model, more features, train longer, reduce regularization
Overfit Low bias · high variance
- Model memorizes training noise
- Low training error, high validation error — big gap
- Fix: more data, regularization (L1/L2, dropout), early stopping, data augmentation, smaller model
How to diagnose in practice
- Plot training loss and validation loss versus epoch. The shape tells you which regime you're in.
- If both losses are high and close → underfit. If training keeps dropping while validation rises → overfit.
- Early stopping is the cheapest regularizer — freeze at the validation-loss minimum.
Gradient Descent
How neural networks actually learn. Click on the loss surface to drop a ball — watch it roll toward a minimum.
Interactive 2-D loss surface
Click anywhere on the surface to place the starting point. Darker regions = lower loss.
The update rule
w ← w − η · ∇w L(w)
wη∇LL(w)Three flavors
Full dataset per step. Smooth path. Slow on big data. Rare today.
~32–512 samples per step. Noisy but fast, escapes shallow minima. The default.
One sample per step. Very noisy. Mostly theoretical now.
Beyond vanilla GD
- Momentum — accumulates a velocity; smooths out oscillations and powers through flat regions.
- Adam — per-parameter adaptive learning rate (running averages of gradient and squared gradient). Robust default for most problems.
- LR scheduling — cosine decay, warmup, step decay. Large LR early for exploration, small LR late for precision.
- Vanishing / exploding gradients — chain-rule products shrink or blow up in deep nets. Fix with ReLU, batch norm, residual connections, gradient clipping.
Learning Rate Schedules
The single most important hyperparameter. Scheduling it over the course of training — not leaving it fixed — is standard practice.
Pick a schedule and watch the LR curve
Picking and tuning the LR
- LR range test (Smith 2017) — sweep LR from very small to large over a few epochs, plot loss vs LR, pick an order of magnitude below the point where loss stops decreasing.
- Linear scaling rule — if you multiply batch size by k, multiply LR by k (up to a point). Essential for multi-GPU.
- Warmup — 500–2000 steps of linear ramp from 0. Prevents early catastrophic divergence in deep / transformer networks.
- Cosine annealing — smooth decay that ends near zero. Small tail LR helps settle into a narrow minimum.
- Warm restarts (SGDR) — periodically reset LR to explore multiple basins. Snapshots at each restart can be ensembled.
- Adam vs SGD — Adam adapts per-parameter; a global schedule still helps but is less critical than for plain SGD.
Regularization · L1, L2, Dropout
Techniques to stop the model memorizing the training set. Each one applies a different prior to the weights.
The regularized loss
L = Ldata(w) + λ · R(w)
Effect on weights · side-by-side histograms
Dropout — visually
At each training step, each hidden neuron is independently dropped with probability p. A different subnetwork trains every step — that's what makes dropout a strong regularizer. At inference, all neurons stay active and activations are scaled by (1−p) (or training-time activations are scaled by 1/(1−p), a.k.a. "inverted dropout").
When to use what
- L2 — almost always on. Start with λ = 1e-4 for most problems.
- L1 — when you want sparsity or feature selection, or when you suspect only a few inputs really matter.
- Dropout — classic for fully-connected layers and early CNNs. Mostly replaced by BN/LN + data augmentation in modern vision. Still used in some Transformer blocks (attention dropout).
- Early stopping — simplest regularizer. Freeze at the validation-loss minimum.
- Data augmentation — flipping, cropping, noise injection, mixup. Often more effective than weight-space regularization.
- Label smoothing — replace hard 0/1 labels with e.g. 0.05/0.95. Prevents overconfident outputs and helps calibration.
- Weight decay ≠ L2 in Adam — use AdamW if you want true decoupled weight decay.
Hyperparameter Search
The outer loop around training. How do you find the best combination of LR, batch size, dropout, depth, width? Three strategies — watch them search the space.
Grid vs Random vs Bayesian · live comparison
Convergence · best objective vs trial number
Good methods find good regions early and close the gap to the true optimum quickly. Bayesian optimization's advantage grows with cost per trial.
Three strategies
Grid search
- Evaluate every combo on a regular grid.
- Easy to reason about, trivially parallel.
- Cost explodes exponentially with dimensions.
- Wastes budget on dimensions that don't actually matter.
Random search
- Uniformly sample hyperparameters.
- Usually beats grid at equal budget (Bergstra & Bengio 2012).
- Still embarrassingly parallel.
- No learning between trials — ignores everything seen so far.
Bayesian optimization
- Fit a surrogate model of the objective (Gaussian Process, Random Forest, or Tree-structured Parzen Estimator).
- Pick the next trial using an acquisition function — Expected Improvement, Upper Confidence Bound, or Thompson sampling — which balances exploration (high uncertainty) against exploitation (high predicted value).
- Sample-efficient: each trial learns from every previous one.
- Harder to parallelize; somewhat serial.
- Gold standard when each trial is expensive (a full training run).
What to know in practice
- Log scale for learning rate, weight decay, regularization strength. Linear for integers like layer count.
- Libraries — Optuna, Hyperopt, Ray Tune, Weights & Biases Sweeps, scikit-optimize.
- Multi-fidelity methods — Hyperband & BOHB start many trials cheaply, kill bad ones early, give survivors more budget. Essential when full training is expensive.
- TPE (Tree-structured Parzen Estimator) — default in Optuna/Hyperopt. Handles conditional and discrete hyperparameters better than vanilla GP.
- ASHA — asynchronous Hyperband; 10-100× speedup vs naive Bayesian opt for deep-learning sweeps.
- Don't tune on the test set — tune on validation. Your test metric is only honest if hyperparameters were never selected using it.
The Training Loop
Everything above, in order. The 6 steps that turn raw weights into a working model.
Key vocabulary
- Epoch — one full pass through the training set.
- Iteration / step — one weight update (one mini-batch).
- Batch size — samples per step. Bigger = more stable gradients, more memory, sometimes worse generalization.
- Loss function — MSE for regression, cross-entropy for classification, custom for structured outputs.
- Backprop — reverse-mode automatic differentiation. Complexity ≈ same as forward pass.
Feedforward Neural Network
The vanilla architecture. Data flows strictly left-to-right. No loops, no memory. Universal approximator for static input→output mappings.
Watch a forward pass
Each neuron computes a = σ(Σ wᵢxᵢ + b). Line thickness ≈ |weight|. Brightness ≈ activation strength.
Math in one line per layer
hℓ = σ( Wℓ · hℓ-1 + bℓ )
Strengths
- Simple, parallelizable, fast to train
- Universal approximator (enough width/depth)
- Great for tabular data, static regression/classification
Limitations
- No memory of past inputs — can't handle sequences natively
- No spatial awareness — needs CNNs for images
- Parameter count explodes with input dimension (fully connected)
Recurrent Neural Network
A feedforward net with a memory loop. Same weights applied at every timestep — built for sequences.
Rolled vs unrolled view
The recurrence
ht = σ( Whh · ht-1 + Wxh · xt + b )
yt = Why · ht
The hidden state ht is a running summary of everything seen so far. Weights Whh, Wxh, Why are shared across all timesteps — that's what makes it a recurrent net.
Why vanilla RNNs fail — and what fixed it
- Vanishing/exploding gradients — gradients flow through the same weight matrix at every step. Long sequences = products blow up or decay to zero.
- LSTM (1997) — adds input/forget/output gates and a protected cell state. Lets information flow across many timesteps.
- GRU — simpler LSTM (2 gates, no separate cell state). Usually comparable performance, fewer parameters.
- Transformers (2017) — replaced recurrence with self-attention. Parallelizable, scales to huge contexts. Dominant architecture today for sequences (NLP, audio, even vision).
- When RNNs still make sense — streaming inference, constrained embedded devices, online control loops where you truly only have O(1) memory per step.
Quantization
Replace 32-bit floats with fewer-bit integers. 4× smaller, ~2–4× faster, almost the same accuracy — if you do it right.
FP32 → INT8 · watch the weight histogram
Two ways to do it
Train in FP32 → calibrate scales on a few hundred samples → quantize weights & activations. Fast, no retraining. Usually fine for 8-bit.
Simulate quantization during training via fake-quant ops. Weights learn to be quantization-friendly. Needed for 4-bit and below.
What it costs, what you gain
- Memory: FP32→INT8 ≈ 4× smaller; INT4 ≈ 8×.
- Compute: Integer MACs are cheaper on most silicon (especially edge/embedded like TriCore, Jetson, mobile NPUs).
- Accuracy: <1% drop typical at INT8 with good calibration. Degrades faster below 4-bit without QAT.
- Symmetric vs asymmetric: symmetric uses a zero-centered scale (simpler); asymmetric adds a zero-point (better for ReLU outputs that are ≥0).
- Per-tensor vs per-channel: per-channel scales preserve accuracy for conv layers with very different weight ranges per filter.
Pruning
Most weights in a trained network are near zero and contribute little. Zero them out — the network barely notices.
Magnitude pruning · drag the threshold
Structured vs unstructured
Unstructured (fine-grained)
- Zero individual weights by magnitude
- Highest compression ratio (often 90%+)
- Needs sparse-matrix hardware to actually speed up
Structured (channel / filter / head)
- Remove whole neurons/channels/filters
- Less compression, but real speedup on any GPU/CPU
- Preferred for deployment
The standard recipe
- Train → prune → fine-tune. Single-shot pruning hurts; iterative pruning + retraining recovers most of the accuracy.
- Lottery ticket hypothesis (Frankle & Carbin, 2019) — a randomly-initialized dense net contains sparse subnetworks ("winning tickets") that, when trained in isolation, match the dense net's accuracy.
- Combine with quantization — pruning + INT8 is standard for edge deployment. Often 10–20× end-to-end compression.
Knowledge Distillation
Train a small "student" model to mimic a big "teacher" model. The student learns from the teacher's soft probabilities, not just the hard labels.
Teacher → Student
Raising the temperature T in softmax(z/T) softens the teacher's distribution — revealing relative similarities between classes ("dark knowledge") that a hard label can't express.
Why soft labels help
A hard label for an image of a dog says: [1, 0, 0, 0] (dog, cat, truck, plane). A teacher's soft probs might say: [0.90, 0.09, 0.005, 0.005] — it thinks "cat" is somewhat plausible but "truck" is not. That second-order information is extra supervision the student learns from.
Variants and gotchas
- Response-based (Hinton 2015) — match final logits. Simplest and most common.
- Feature-based — match intermediate layer activations. Useful when student architecture differs a lot.
- Self-distillation — teacher and student have the same architecture; still helps.
- Combine with pruning + quantization — the standard compression pipeline for edge deployment.
- Temperature — typical T between 2 and 10. Too high = flat distribution, no signal. Scale loss by T² to keep gradient magnitudes right.
Concept Cheat Sheet
The 60-second answer for each concept. If you can say this out loud cleanly, you've got it.
w ← w − η∇L(w). Move weights opposite to gradient of loss. Mini-batch SGD is the practical default; Adam is the robust adaptive version.softmax(z/T)) plus the hard label. Transfers "dark knowledge" about class similarities. Enables huge compression with minimal accuracy loss.Red flags to watch for in practice
- "Your training loss keeps dropping but validation flattens — what do you do?" → overfit, add regularization / more data / earlier stopping.
- "Learning rate too high — what happens?" → loss oscillates or diverges. Too low → slow / stuck in saddle.
- "Why doesn't vanilla RNN work on long sequences?" → vanishing/exploding gradients through repeated multiplication of the same recurrent weight matrix.
- "Difference between L1 and L2 regularization?" → L1 pushes weights exactly to zero (sparsity); L2 just shrinks them.
- "Why is batch size a big deal?" → affects gradient noise, convergence speed, memory, and generalization. Small batches regularize; huge batches need LR warmup and scaling.
- "When would you choose distillation over pruning?" → when you need architectural freedom in the student (different depth/width). Pruning keeps the same architecture.