Autoencoders & VAEs — An Illustrated Primer

The core idea

An autoencoder is a neural network with an unusual job description: reproduce your own input. Given an image of a cat, output that same image. Given a sensor trace, return that same trace. Sounds pointless — until you notice the catch: the network has to pass its input through a narrow middle layer that can only hold a handful of numbers. It must squeeze the essence of the thing down to a few coordinates, then rebuild it on the other side.

That squeeze is where learning happens. The network cannot memorize pixel-by-pixel; it has to find structure. It has to discover that what makes a cat a cat can be summarized by, say, thirty-two numbers. The encoder learns to compress. The decoder learns to reconstruct. And the narrow waist in between — the latent space — becomes a map of the data's underlying geometry.

Key idea

Force a network to reconstruct its input through a bottleneck, and it will teach itself what matters. The bottleneck is a filter for meaning.

Anatomy of an autoencoder

The whole machine has three parts: an encoder that shrinks the input, a latent code that sits in the middle, and a decoder that expands it back out. Training is unsupervised — no labels needed — because the target is just the input itself.

Bottleneck width 4

An autoencoder rendered as a sketch. The encoder (left) progressively narrows; the decoder (right) mirrors it. Drag the slider to change how many numbers the latent code can hold.

What actually happens, mathematically, is simple. The encoder is a function z = f(x) that turns your high-dimensional input x into a low-dimensional code z. The decoder is another function x' = g(z) that tries to reconstruct x from z. Training minimizes reconstruction loss — typically mean squared error between x and x'. That's it. No labels, no supervision, just "try not to lose anything important in the middle."

The bottleneck, interactively

The most concrete way to understand the bottleneck is to watch reconstruction quality degrade as you make it tighter. Below is a synthetic "signal" — a sum of sinusoids meant to stand in for anything you might sensor-log: a current waveform, a vibration trace, a price series. The red curve is the input. The green curve is what comes out of an autoencoder with k principal components (this is a linear autoencoder — mathematically equivalent to PCA).

Latent dimensions 3

Signal complexity 5

The reconstruction gets visibly better as you add capacity. Note the sharp knee: early dimensions capture the dominant structure, later ones chase noise and fine detail. This is exactly the intuition behind choosing a latent size in practice.

Two things to notice. First, reconstruction quality is non-linear in latent size — the first few dimensions matter enormously, the rest give diminishing returns. Second, if you push the signal complexity up faster than the latent size, the reconstruction stays smooth but loses high-frequency content. The autoencoder is choosing what to keep, and it keeps the loud stuff.

What lives inside the middle?

The latent code isn't just a compressed file. It's an organized space. When you train an autoencoder on, say, handwritten digits, similar digits end up near each other in the latent space — all the 7s cluster in one region, all the 0s in another. The network has learned a coordinate system for "digit-ness" without ever being told what the labels are.

Class A Class B Class C

Three classes of points in the original space, colored by their true label (the autoencoder never sees the colors). Hit Train to watch the 2-dimensional latent representation organize itself — same-class points gravitate together purely because they look alike.

This organization is the reason autoencoders are useful for more than compression. Once you have a well-organized latent space you can: detect anomalies (points that don't fit any cluster reconstruct badly), visualize high-dimensional data (plot the 2-D latent codes), pre-train representations for downstream classifiers, and — crucially — generate new data by sampling from the latent space and decoding. But that last one has a subtle problem, which is where the VAE walks in.

Enter the Variational Autoencoder

A plain autoencoder maps each input to a single point in latent space. If you want to generate new data, you'd like to pick a random point and decode it — but where exactly? The points your network has seen cluster in some unknown region; in between and around them is empty territory where the decoder produces garbage. The latent space, for a vanilla autoencoder, is full of holes.

The Variational Autoencoder fixes this by encoding each input not as a point, but as a probability distribution — specifically, a Gaussian with a mean μ and a standard deviation σ. During training, the actual latent code is sampled from that Gaussian. This fuzzy encoding forces overlapping clouds of possibility, which forces the latent space to be continuous, which means every point you sample decodes to something plausible.

Autoencoder — point

VAE — distribution

σ (VAE spread) 0.6

On the left, the autoencoder drops each input at an exact point. On the right, the VAE encodes to a cloud — the shaded ellipse is one standard deviation of the encoded distribution. Each time you press Draw sample, a fresh code is drawn from the cloud. This is the core move.

The loss function has two jobs

VAEs add a second term to the loss. The first is still reconstruction quality. The second is a KL divergence that pushes each encoded distribution towards a standard Gaussian — a unit blob centered at the origin. These two forces are in tension: reconstruction wants distributions narrow and far apart (each input should decode crisply); the KL term wants them wide and centered. The equilibrium is a smooth, well-packed latent space where every point you sample decodes to something coherent.

The reparameterization trick

Sampling breaks gradient flow — you can't backprop through randomness. The fix: write z = μ + σ · ε, where ε is noise drawn outside the network. Gradients flow through μ and σ cleanly; the randomness sits to the side. This is the single trick that makes VAEs trainable.

A walk through latent space

The real reward for all this machinery is that the trained VAE's latent space becomes a smooth manifold of possibilities. Move your cursor a little, the output morphs a little. There are no cliffs, no dead zones. Try it — drag anywhere in the left pane below and watch the decoded shape respond.

Latent space (drag me)

Decoded output

z₁ = 0.00 z₂ = 0.00 → output morphs between four learned prototypes

A toy VAE decoder: four "prototype" shapes sit at the corners of the 2D latent space (circle, star, flower, cog). The decoder smoothly interpolates between them based on where you drag. This is exactly the behavior that makes VAEs useful for generating — pull the cursor, pull a new valid sample.

Where these things earn their keep

Autoencoders and VAEs aren't just pretty demos — they're workhorses in several practical domains. Here are the most common jobs they're hired for.

Anomaly detection

Train on normal data only. At inference, anomalies reconstruct poorly. Widely used for manufacturing defects, credit card fraud, network intrusion, and — for those of us in powertrain — sensor-level fault detection in inverters, bearings, and windings.

Denoising

Train with corrupted inputs and clean targets. The network learns to invert the noise process. Applied in medical imaging, audio restoration, and cleaning up noisy current/voltage measurements before they hit downstream control logic.

Dimensionality reduction

A nonlinear cousin of PCA. Great for visualization, for pre-compressing inputs to heavy downstream models, and for reducing high-dimensional calibration surfaces (think flux maps, look-up tables) to a handful of meaningful knobs.

Pre-training & representation learning

Autoencoder-trained encoders provide general-purpose features that jump-start downstream supervised tasks. Particularly valuable when labels are scarce but unlabeled data is abundant.

Generative modeling (VAE)

Sample from the latent prior, decode, and you have new synthetic data. Used to augment training sets for rare classes — including rare fault conditions that never appear enough in real operation to train a classifier on.

Interpolation & design exploration

Because the latent space is smooth, you can interpolate between known examples to explore plausible intermediates. Useful for design tools, shape morphing, and for sweeping operating points in controller tuning.

Semantic compression

Beyond bits-saved, the latent code carries meaning. Cluster latent codes and you often find interpretable groupings — driving modes, fault types, customer segments — without having labeled any of them.

Conditional generation

Condition the encoder/decoder on a label (CVAE) and you can generate class-specific samples. Good for balanced synthetic datasets and for exploring "what a fault of type X might look like under operating condition Y."

Practical notes from the trenches

The gap between a textbook autoencoder and one that works on real data is usually a pile of small, unglamorous decisions. A non-exhaustive list:

Choose the latent size by the elbow, not by faith. Sweep dim ∈ {2, 4, 8, 16, 32, 64} and plot validation reconstruction error. There's usually a knee — past it, you're modeling noise. Below it, you're throwing away signal.
Normalize your inputs. Especially with MSE loss, one feature with huge variance will eat the reconstruction budget. Z-score or min-max every feature. Log-scale heavy-tailed quantities (currents, vibrations).
MSE is not always the right loss. For image-like data, MSE produces blurry reconstructions; try perceptual losses or binary cross-entropy on pixel intensities. For heavy-tailed signals, Huber loss is more forgiving of outliers.
VAE training requires a KL warm-up. Start with the KL term weight at zero and ramp it up over the first few epochs. Otherwise the KL term wins early, every encoded distribution collapses to the prior, and the decoder learns to ignore the code entirely. This failure mode is called posterior collapse, and it is common.
β-VAE is your tuning knob. Scaling the KL term by a coefficient β trades reconstruction against disentanglement. β < 1 = sharper reconstructions, less regular latent space. β > 1 = smoother, more disentangled, blurrier. There is no free lunch.
For anomaly detection, train only on normals and use reconstruction error as the score. Don't include any anomalies in training — they'll corrupt the "normal" manifold. Pick a threshold on a held-out validation set of both normals and anomalies.
Watch out for identity collapse. If the encoder and decoder are too powerful, and the bottleneck too wide, the network can learn a near-identity map that reconstructs everything perfectly — including anomalies. Narrow the bottleneck or add noise.
Denoising autoencoders are often better than plain ones. Corrupt the input (drop features, add noise) and ask the network to reconstruct the clean version. You get more robust, more meaningful latent codes and better downstream performance nearly for free.
Convolutional architectures for spatial data, 1D conv or LSTM for time series. MLPs work but underperform — they don't exploit the structure you know is there.
For generation, prefer VAE or diffusion; for compression and anomaly detection, vanilla or denoising AE is often enough. Don't reach for the fanciest tool if a simpler one solves the problem.

References & further reading

Kingma, D. P., & Welling, M. (2013). Auto-Encoding Variational Bayes. arXiv:1312.6114. The original VAE paper.
Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic Backpropagation and Approximate Inference in Deep Generative Models. ICML. The parallel invention of VAEs.
Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). Extracting and composing robust features with denoising autoencoders. ICML.
Higgins, I., et al. (2017). β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. ICLR. The β-VAE for disentangled representations.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 14 is the textbook treatment of autoencoders.
Doersch, C. (2016). Tutorial on Variational Autoencoders. arXiv:1606.05908. A clear, practical walkthrough of the math.
Kingma, D. P., & Welling, M. (2019). An Introduction to Variational Autoencoders. Foundations and Trends in Machine Learning. Comprehensive modern survey by the originators.
Chen, X., et al. (2016). Variational Lossy Autoencoder. ICLR. On posterior collapse and how to mitigate it.
Chalapathy, R., & Chawla, S. (2019). Deep Learning for Anomaly Detection: A Survey. arXiv:1901.03407.
Weng, L. (2018). From Autoencoder to Beta-VAE. lilianweng.github.io — a widely-read blog post with clear diagrams.