Gaussian Processes — an Interactive Explainer

A Gaussian Process is one of those ideas that looks abstract on paper and clicks the moment you can actually play with one. The object of this post is that moment. Below is an interactive plot where the prior, the posterior, and all the kernel hyperparameters respond in real time. The prose around it tries to match what you’re seeing on the screen to the math underneath — and, in the last section, to the very un-glamorous reality of the $O(n^3)$ cost that determines when GPs earn their keep.

What is a GP, really?

The plain-English version. Imagine you’re trying to guess an unknown function from just a few measurements. A Gaussian Process is a principled way of saying: “here are all the functions I think are plausible” — and then updating that belief every time you see a new data point.

Three ideas to hold onto:

It’s a distribution over functions, not over numbers. Instead of “$x = 3.2 \pm 0.5$”, a GP gives you “the function could be this shape, or this one, or this one” — an infinite family of curves, each with a probability.
The kernel encodes your assumption about smoothness: points that are close in input should have similar output values. That one assumption is enough to turn a handful of measurements into a full curve with confidence bounds.
The magic is calibrated uncertainty. Near your data the GP is confident, far from it the GP is humble and the error bars grow. That honesty is what makes GPs useful for control, optimization, and diagnostics — the model tells you when not to trust it.

A one-line mental model: a GP is linear regression with infinitely many features, where the kernel silently handles the infinite sum for you.

The one-sentence definition

A Gaussian Process is a distribution over functions such that any finite collection of function values has a joint Gaussian distribution. It is fully specified by a mean function $m(x)$ and a covariance (kernel) function $k(x, x’)$:

\[f(x) \sim \mathcal{GP}\!\left( m(x),\; k(x, x') \right)\]

In practice we almost always set $m(x) = 0$ (after centering the data) and do all the modeling work through the kernel.

Interactive demo

Posterior: click on the plot to add (or remove) observations. The GP conditions on them — the mean threads through the points, variance collapses nearby, and stays high far away.

Posterior mean

95% credible band

Sampled functions

Observations

Length scale ℓ 1.00

Smoothness of sampled functions

Signal variance σ_f² 1.00

Vertical amplitude of functions

Noise σ_n 0.10

Observation noise std

Sample paths 5

Number of drawn functions

Kernel: squared exponential (RBF)

k(x, x') = σ_f² · exp( −(x − x')² / (2ℓ²) )

How to read the plot

The prior

Click Prior and slide Sample paths up. Every colored curve is one function drawn from $\mathcal{GP}(0, k)$. The shaded band is the 95% credible region. Before seeing any data, the GP says “the function could be any of these.”

The length scale $\ell$ controls how wiggly these samples are:

Small $\ell$ — nearby inputs are only weakly correlated, so samples look rough.
Large $\ell$ — strong correlation, so samples look smooth.

The signal variance $\sigma_f^2$ scales the vertical amplitude of the prior.

The posterior

Click Posterior and then click anywhere on the plot to drop observations. Click an existing point to remove it. Two things happen:

The mean curve threads through the observations (or very close, limited by the noise $\sigma_n$).
The uncertainty band collapses near data points and re-opens in regions with no data.

Key takeaway. This is the main selling point for GPs: calibrated uncertainty that grows where you haven’t looked. That’s exactly why they shine in sparse-data regimes like active learning, Bayesian optimization, and safe control.

The math in three lines

Given training data $(X, y)$ with i.i.d. Gaussian noise $\sigma_n^2$, the joint distribution of the observed targets and the function value $f_$ at a test point $x_$ is:

\[\begin{bmatrix} y \\ f_* \end{bmatrix} \sim \mathcal{N}\!\left( \begin{bmatrix} 0 \\ 0 \end{bmatrix},\; \begin{bmatrix} K(X,X) + \sigma_n^2 I & K(X, x_*) \\ K(x_*, X) & K(x_*, x_*) \end{bmatrix} \right)\]

Conditioning on $y$ using the standard Gaussian conditioning identity gives closed-form posterior mean and variance:

\[\mu_* = K(x_*, X) \left[ K(X,X) + \sigma_n^2 I \right]^{-1} y\] \[\sigma_*^2 = K(x_*, x_*) - K(x_*, X) \left[ K(X,X) + \sigma_n^2 I \right]^{-1} K(X, x_*)\]

The widget implements exactly this. For numerical stability we use a Cholesky decomposition:

K + σ_n²·I = L · Lᵀ       (Cholesky, O(n³))
α = Lᵀ \ (L \ y)           (triangular solves)
μ*  = k*ᵀ · α              (mean at test point)
v   = L \ k*
σ*² = k(x*, x*) − vᵀv      (variance at test point)

Common kernels

Squared Exponential (RBF). Infinitely differentiable. Produces very smooth functions. Default choice, used in the demo above.
Matérn ($\nu = 3/2,\ 5/2$). Controllable smoothness. Often more realistic than RBF for physical signals that are continuous but not infinitely smooth.
Periodic. Encodes periodicity with a fixed period. Good for oscillatory signals like motor ripple or seasonal effects.
Linear. Recovers Bayesian linear regression as a special case.
Sums and products. Kernels are closed under addition and multiplication, so you can compose them — e.g. Periodic · RBF for a slowly-decaying oscillation.

Hyperparameter learning

The kernel hyperparameters $\theta = {\ell, \sigma_f, \sigma_n}$ are typically learned by maximizing the log marginal likelihood of the observed data:

\[\log p(y \mid X, \theta) = -\tfrac{1}{2}\, y^{\!\top} \!\left[K + \sigma_n^2 I\right]^{-1} y \;-\; \tfrac{1}{2} \log \!\left| K + \sigma_n^2 I \right| \;-\; \tfrac{n}{2} \log 2\pi\]

The three terms have a clean interpretation: data fit, complexity penalty, and a constant. This is the built-in Occam’s razor that makes GPs elegant — overly wiggly models are penalized by the log-determinant term.

Computational complexity

GPs are conceptually clean but computationally heavy. The cost is dominated by a single operation: inverting (or Cholesky-factorizing) the $n \times n$ kernel matrix $K + \sigma_n^2 I$, where $n$ is the number of training points.

Exact GP — the honest numbers

Operation	Time	Memory	What’s happening
Training (one-time)	`O(n³)`	`O(n²)`	Build `K`, do Cholesky `K = LLᵀ`, solve for `α = K⁻¹y`
Predict mean (per test point)	`O(n)`	`O(n)`	Inner product `k*ᵀα` — cheap once `α` is cached
Predict variance (per test point)	`O(n²)`	`O(n)`	Triangular solve `v = L \ k`, then `σ² = k** − vᵀv`
Hyperparameter learning (per iter.)	`O(n³)`	`O(n²)`	New `θ` → rebuild `K` → redo Cholesky → gradient of log-marginal-likelihood

Rule of thumb. Exact GP is comfortable up to a few thousand points on a laptop. At $n \approx 10{,}000$ you’re hitting the wall (memory for $K$ alone is roughly 800 MB in float64, and one Cholesky takes minutes). Beyond that, you need approximations.

Why it’s $O(n^3)$

The $O(n^3)$ cost comes from the Cholesky decomposition of the kernel matrix, which is the dominant numerical step. Every time hyperparameters change during training, the matrix changes, and the whole factorization has to be redone. Mean prediction at a new test point is cheap because it’s just a dot product against a precomputed vector. Variance prediction is more expensive because each test point needs its own triangular solve against $L$.

Scaling beyond exact GP

Method	Training	Prediction	Idea
Exact GP	`O(n³)`	`O(n²)`	Full Cholesky — baseline
Sparse GP / FITC	`O(n·m²)`	`O(m²)`	`m` inducing points summarize `n` training points
SVGP (variational)	`O(m³)` per batch	`O(m²)`	Mini-batch training; scales to millions of points
KISS-GP / SKI	`O(n + m log m)`	`O(1)` amortized	Structured kernel interpolation on a grid
Local GPs	`O(k³)` per local model	`O(k²)`	Partition input space, fit a small GP per region

Here $m$ is the number of inducing (pseudo-)points, usually $m \ll n$, and $k$ is the local neighborhood size. In practice, $m$ in the range 50–500 is common.

Embedded context. For real-time control (GP-augmented MPC, for instance), even the $O(n)$ or $O(m)$ prediction cost matters. Common tricks: freeze hyperparameters offline, precompute $\alpha$, use a small fixed inducing set, or switch to a parametric approximation once the GP has been learned.

When to reach for a GP

Small-to-medium data where uncertainty quantification matters more than raw throughput (GP inference is $O(n^3)$, impractical beyond ~10k points without approximations).
Bayesian optimization — GP surrogate + acquisition function (EI, UCB) for expensive black-box optimization.
Active learning — pick the next query where posterior variance is highest.
Safe / uncertainty-aware control — GP-augmented MPC uses the GP posterior mean to correct model mismatch and the variance to gate how aggressively the controller trusts that correction.
System identification and calibration — non-parametric regression with confidence bounds that grow in extrapolation regions.
Diagnostics and prognostics — detect when the current operating point is out-of-distribution relative to training data.

Limitations to be honest about

Scaling. Naïve GP is $O(n^3)$ / $O(n^2)$. Sparse variational GPs, inducing points, and structured kernels (KISS-GP, SKI) push this further.
High input dimensions. Stationary kernels suffer in high-$D$ without ARD or dimensionality reduction.
Kernel choice matters. A misspecified kernel gives confidently wrong uncertainty — the variance is only calibrated given the model.
Non-Gaussian likelihoods. Classification and count data need approximations (Laplace, EP, variational).

The interactive demo above is implemented with vanilla JavaScript and the Canvas 2D API — Cholesky solve, posterior sampling, and kernel evaluation are all in-browser, no external libraries. View source on the page if you want to read it.